Gene expression signature for classification of tissue of origin of tumor samples

ABSTRACT

The present invention provides a process for classification of cancers and tissues of origin through the analysis of the expression patterns of specific microRNAs and nucleic acid molecules relating thereto. Classification according to a microRNA tree-based expression framework allows optimization of treatment, and determination of specific therapy.

FIELD OF THE INVENTION

The present invention relates to methods for classification of cancersand the identification of their tissue of origin. Specifically theinvention relates to microRNA molecules associated with specificcancers, as well as various nucleic acid molecules relating thereto orderived therefrom.

BACKGROUND OF THE INVENTION

microRNAs (miRs, miRNAs) are a novel class of non-coding, regulatory RNAgenes¹⁻³ which are involved in oncogenesis⁴ and show remarkabletissue-specificity⁵⁻⁷. They have emerged as highly tissue-specificbiomarkers^(2,5,6) postulated to play important roles in encodingdevelopmental decisions of differentiation. Various studies have tiedmicroRNAs to the development of specific malignancies⁴. MicroRNAs arealso stable in tissue, stored frozen or as formalin-fixed,paraffin-embedded (FFPE) samples, and in serum.

Hundreds of thousands of patients in the U.S. are diagnosed each yearwith a cancer that has already metastasized, without a clearlyidentified primary site. Oncologists and pathologists are constantlyfaced with a diagnostic dilemma when trying to identify the primaryorigin of a patient's metastasis. As metastases need to be treatedaccording to their primary origin, accurate identification of themetastases' primary origin can be critical for determining appropriatetreatment.

Once a metastatic tumor is found, the patient may undergo a wide rangeof costly, time consuming, and at times inefficient tests, includingphysical examination of the patient, histopathology analysis of thebiopsy, imaging methods such as chest X-ray, CT and PET scans, in orderto identify the primary origin of the metastasis.

Metastatic cancer of unknown primary (CUP) accounts for 3-5% of all newcancer cases, and as a group is usually a very aggressive disease with apoor prognosis¹⁰. The concept of CUP comes from the limitation ofpresent methods to identify cancer origin, despite an often complicatedand costly process which can significantly delay proper treatment ofsuch patients. Recent studies revealed a high degree of variation inclinical management, in the absence of evidence based treatment forCUP¹¹. Many protocols were evaluated¹² but have shown relatively smallbenefit¹³. Determining tumor tissue of origin is thus an importantclinical application of molecular diagnostics⁹.

Molecular classification studies for tumor tissue origin¹⁴⁻¹⁷ havegenerally used classification algorithms that did not utilizedomain-specific knowledge: tissues were treated as a-priori equivalents,ignoring underlying similarities between tissue types with a commondevelopmental origin in embryogenesis. An exception of note is the studyby Shedden and co-workers¹⁸, that was based on a pathologyclassification tree. These studies used machine-learning methods thataverage effects of biological features (e.g., mRNA expression levels),an approach which is more amenable to automated processing but does notuse or generate mechanistic insights.

Various markers have been proposed to indicate specific types of cancersand tumor tissue of origin. However, the diagnostic accuracy of tumormarkers has not yet been defined. There is thus a need for a moreefficient and effective method for diagnosing and classifying specifictypes of cancers.

SUMMARY OF THE INVENTION

The present invention provides specific nucleic acid sequences for usein the identification, classification and diagnosis of specific cancersand tumor tissue of origin. The nucleic acid sequences can also be usedas prognostic markers for prognostic evaluation and determination ofappropriate treatment of a subject based on the abundance of the nucleicacid sequences in a biological sample. The present invention furtherprovides a method for accurate identification of tumor tissue origin.

The invention is based in part on the development of a microRNA-basedclassifier for tumor classification. microRNA expression levels weremeasured in 903 paraffin-embedded samples from 26 different tumorclasses, corresponding to 18 distinct tissues and organs, includingprimary and metastatic tumors. microRNA microarray, of the samples aswell as qRT-PCR data, were used to construct a classifier, based on 48tissue-specific microRNAs, each linked to specificdifferential-diagnosis roles.

The overall sensitivity of the independent blinded test in identifyingthe tumor tissue of origin is 84%, with 97% specificity. High confidencepredictions reach 90% sensitivity with 99% specificity.

The findings demonstrate the utility of microRNA as novel biomarkers forthe tissue of origin of a metastatic tumor. The classifier has widebiological as well as diagnostic applications. According to a firstaspect, the present invention provides a method of identifying a tissueof origin of a biological sample, the method comprising: obtaining abiological sample from a subject; determining an expression profile ofindividual nucleic acids for a predetermined set of microRNAs; andclassifying the tissue of origin for said sample by a classifier.According to one embodiment, said classifier is a decision tree model.

According to another aspect, the present invention provides a method ofclassifying a tissue of origin of a biological sample, the methodcomprising: obtaining a biological sample from a subject; determining anexpression profile in said sample of nucleic acid sequences selectedfrom the group consisting of SEQ ID NOS: 1-49, or a sequence having atleast about 80% identity thereto; and comparing said expression profileto a reference expression profile by using a classifier algorithm;whereby the expression of any of said nucleic acid sequences orcombinations thereof allows the identification of the tissue of originof said sample.

According to one embodiment, said classifier algorithm is a decisiontree classifier, logistic regression classifier, linear regressionclassifier, nearest neighbor classifier (including K nearest neighbors),neural network classifier, Gaussian mixture model (GMM) classifier andSupport Vector Machine (SVM) classifier, nearest centroid classifier,random forest classifier or any boosting or bootstrap aggregating(bagging) of those classifiers.

According to certain embodiments, said tissue is selected from the groupconsisting of liver, lung, bladder, prostate, breast, colon, ovary,testis, stomach, thyroid, pancreas, brain, head and neck, kidney,melanocytes, thymus, biliary tract and esophagus.

According to some embodiments said biological sample is a canceroussample.

According to another aspect, the present invention provides a method ofclassifying a cancer, the method comprising: obtaining a biologicalsample from a subject; measuring the relative abundance in said sampleof nucleic acid sequences selected from the group consisting of SEQ IDNOS: 1-49 or a sequence having at least about 80% identity thereto; andcomparing said obtained measurement to reference values representingabundance of said nucleic acid sequences by using a classifieralgorithm; whereby the relative abundance of said nucleic acid sequencesallows the classification of said cancer.

According to some embodiments, said reference values are predeterminedthresholds.

According to one embodiment, said sample is obtained from a subject witha metastatic cancer. According to another embodiment, said sample isobtained from a subject with cancer of unknown primary (CUP). Accordingto a further embodiment, said sample is obtained from a subject with aprimary cancer. According to still another embodiment, said sample is atumor of unidentified origin, a metastatic tumor or a primary tumor.

According to certain embodiments, said cancer is selected from the groupconsisting of liver cancer, biliary tract cancer, lung cancer, bladdercancer, prostate cancer, breast cancer, colon cancer, ovarian cancer,testicular cancer, stomach cancer, thyroid cancer, pancreas cancer,brain cancer, head and neck cancer, kidney cancer, melanoma, thymuscancer and esophagus cancer.

According to some embodiments, said lung cancer is selected from thegroup consisting of lung carcinoid, lung small cell carcinoma, lungadenocarcinoma, and lung squamous cell carcinoma.

According to some embodiments, said brain cancer is selected from thegroup consisting of brain astrocytoma and brain oligodendroglioma.

According to some embodiments, said thyroid cancer is selected from thegroup consisting of thyroid follicular, thyroid papillary and thyroidmedullary cancer.

According to some embodiments, said ovarian cancer is selected from thegroup consisting of ovarian endometrioid and ovarian serous cancer.

According to some embodiments, said testicular cancer is selected fromthe group consisting of testicular non-seminoma and testicular seminoma.

According to some embodiments, said esophagus cancer is selected fromthe group consisting of esophagus adenocarcinoma and esophagus squamouscell carcinoma.

According to some embodiments, said head and neck cancer is selectedfrom the group consisting of larynx carcinoma, pharynx carcinoma andnose carcinoma.

According to some embodiments, said biliary tract cancer is selectedfrom the group consisting of cholangiocarcinoma and gallbladderadenocarcinoma.

According to other embodiments, said biological sample is selected fromthe group consisting of bodily fluid, a cell line, a tissue sample, abiopsy sample, a needle biopsy sample, a surgically removed sample, anda sample obtained by tissue-sampling procedures. According to someembodiments the biological sample is a fine needle aspiration (FNA)sample. According to some embodiments, said tissue is a fresh, frozen,fixed, wax-embedded or formalin-fixed paraffin-embedded (FFPE) tissue.

The classification method of the present invention comprises the use ofat least one classifier algorithm, said classifier algorithm is selectedfrom the group consisting of decision tree classifier, logisticregression classifier, linear regression classifier, nearest neighborclassifier (including K nearest neighbors), neural network classifier,Gaussian mixture model (GMM) classifier and Support Vector Machine (SVM)classifier, nearest centroid classifier, random forest classifier or anyboosting or bootstrap aggregating (bagging) of those classifiers.

The classifier may use a decision tree structure (including binary tree)or a voting (including weighted voting) scheme to compare theclassification of one or more classifier algorithms in order to reach aunified or majority decision.

The invention further provides a method for classifying a cancer ofliver origin, the method comprising measuring the relative abundance ofa nucleic acid sequence selected from the group consisting of SEQ IDNOS: 6, 9, 25, 26, or a sequence having at least about 80% identitythereto in a sample obtained from a subject; wherein the abundance ofsaid nucleic acid sequence is indicative of a cancer of liver origin.

The invention further provides a method for classifying a cancer oftesticular origin, the method comprising measuring the relativeabundance of a nucleic acid sequence selected from the group consistingof SEQ ID NOS: 6, 26, 41, or a sequence having at least about 80%identity thereto in a sample obtained from a subject; wherein theabundance of said nucleic acid sequence is indicative of a cancer oftesticular origin.

The invention further provides a method for classifying a cancer oftesticular seminoma origin, the method comprising measuring the relativeabundance of a nucleic acid sequence selected from the group consistingof SEQ ID NOS: 6, 26, 31, 41, 45, 48 or a sequence having at least about80% identity thereto in a sample obtained from a subject; wherein theabundance of said nucleic acid sequence is indicative of a cancer oftesticular seminoma origin.

The invention further provides a method for classifying a cancer ofmelanoma origin, the method comprising measuring the relative abundanceof a nucleic acid sequence selected from the group consisting of SEQ IDNOS: 6, 15, 17, 26, 41, 46, or a sequence having at least about 80%identity thereto in a sample obtained from a subject; wherein theabundance of said nucleic acid sequence is indicative of a cancer ofmelanoma origin.

The invention further provides a method for classifying a cancer ofkidney origin, the method comprising measuring the relative abundance ofa nucleic acid sequence selected from the group consisting of SEQ IDNOS: 6, 7, 15, 17, 26, 41, 46, 47, or a sequence having at least about80% identity thereto in a sample obtained from a subject; wherein theabundance of said nucleic acid sequence is indicative of a cancer ofkidney origin.

The invention further provides a method for classifying a cancer ofbrain origin, the method comprising measuring the relative abundance ofa nucleic acid sequence selected from the group consisting of SEQ IDNOS: 6, 7, 15, 17, 26, 41, 46, 47, or a sequence having at least about80% identity thereto in a sample obtained from a subject; wherein theabundance of said nucleic acid sequence is indicative of a cancer ofbrain origin.

The invention further provides a method for classifying a cancer ofbrain astrocytoma origin, the method comprising measuring the relativeabundance of a nucleic acid sequence selected from the group consistingof SEQ ID NOS: 6, 7, 10, 15, 17, 26, 41, 46, 47, or a sequence having atleast about 80% identity thereto in said sample; wherein the abundanceof said nucleic acid sequence is indicative of a cancer of brainastrocytoma origin.

The invention further provides a method for classifying a cancer ofbrain oligodendroglioma origin, the method comprising measuring therelative abundance of a nucleic acid sequence selected from the groupconsisting of SEQ ID NOS: 6, 7, 10, 15, 17, 26, 41, 46, 47, or asequence having at least about 80% identity thereto in said sample;wherein the abundance of said nucleic acid sequence is indicative of acancer of brain oligodendroglioma origin.

The invention further provides a method for classifying a cancer ofthyroid medullary origin, the method comprising measuring the relativeabundance of a nucleic acid sequence selected from the group consistingof SEQ ID NOS: 6, 17-19, 24, 26, 32, 41, 42, or a sequence having atleast about 80% identity thereto in a sample obtained from a subject;wherein the abundance of said nucleic acid sequence is indicative of acancer of thyroid medullary origin.

The invention further provides a method for classifying a cancer of lungcarcinoid origin, the method comprising measuring the relative abundanceof a nucleic acid sequence selected from the group consisting of SEQ IDNOS: 3, 6, 17-19, 24, 26, 32, 36, 41, 42, or a sequence having at leastabout 80% identity thereto in a sample obtained from a subject; whereinthe abundance of said nucleic acid sequence is indicative of a cancer oflung carcinoid origin.

The invention further provides a method for classifying a cancer of lungsmall cell carcinoma origin, the method comprising measuring therelative abundance of a nucleic acid sequence selected from the groupconsisting of SEQ ID NOS: 3, 6, 17-19, 24, 26, 32, 36, 41, 42, or asequence having at least about 80% identity thereto in a sample obtainedfrom a subject; wherein the abundance of said nucleic acid sequence isindicative of a cancer of lung small cell carcinoma origin.

The invention further provides a method for classifying a cancer ofcolon origin, the method comprising measuring the relative abundance ofa nucleic acid sequence selected from the group consisting of SEQ IDNOS: 1, 3, 4, 6, 17-19, 21, 26, 29, 34, 37, 41, 42, 48, or a sequencehaving at least about 80% identity thereto in a sample obtained from asubject; wherein the abundance of said nucleic acid sequence isindicative of a cancer of colon origin.

The invention further provides a method for classifying a cancer ofstomach origin, the method comprising measuring the relative abundanceof a nucleic acid sequence selected from the group consisting of SEQ IDNOS: 1, 3, 4, 6, 17-19, 21, 26, 29, 34, 37, 41, 42, 48, or a sequencehaving at least about 80% identity thereto in a sample obtained from asubject; wherein the abundance of said nucleic acid sequence isindicative of a cancer of stomach origin.

The invention further provides a method for classifying a cancer ofpancreas origin, the method comprising measuring the relative abundanceof a nucleic acid sequence selected from the group consisting of SEQ IDNOS: 1, 3, 6, 17-19, 21, 26, 28, 29, 33, 37, 41, 42, or a sequencehaving at least about 80% identity thereto in a sample obtained from asubject; wherein the abundance of said nucleic acid sequence isindicative of a cancer of pancreas origin.

The invention further provides a method for classifying a cancer ofbiliary tract origin, the method comprising measuring the relativeabundance of a nucleic acid sequence selected from the group consistingof SEQ ID NOS: 1, 3, 6, 9, 17-19, 21, 25, 26, 28, 29, 33, 37, 41, 42, ora sequence having at least about 80% identity thereto in a sampleobtained from a subject; wherein the abundance of said nucleic acidsequence is indicative of a cancer of biliary tract origin.

The invention further provides a method for classifying a cancer ofprostate origin, the method comprising measuring the relative abundanceof a nucleic acid sequence selected from the group consisting of SEQ IDNOS: 3, 6, 17-21, 26, 41, 42, or a sequence having at least about 80%identity thereto in a sample obtained from a subject; wherein theabundance of said nucleic acid sequence is indicative of a cancer ofprostate origin.

The invention further provides a method for classifying a cancer ofovarian origin, the method comprising measuring the relative abundanceof a nucleic acid sequence selected from the group consisting of SEQ IDNOS: 3, 5, 6, 11, 17-21, 26, 30, 41, 42, or a sequence having at leastabout 80% identity thereto in a sample obtained from a subject; whereinthe abundance of said nucleic acid sequence is indicative of a cancer ofovarian origin.

The invention further provides a method for classifying a cancer ofovarian endometrioid origin, the method comprising measuring therelative abundance of a nucleic acid sequence selected from the groupconsisting of SEQ ID NOS: 2, 3, 5, 6, 11, 17-22, 26, 30, 41, 42, or asequence having at least about 80% identity thereto in a sample obtainedfrom a subject; wherein the abundance of said nucleic acid sequence isindicative of a cancer of ovarian endometrioid origin.

The invention further provides a method for classifying a cancer ofovarian serous origin, the method comprising measuring the relativeabundance of a nucleic acid sequence selected from the group consistingof SEQ ID NOS: 2, 3, 5, 6, 11, 17-22, 26, 30, 41, 42, or a sequencehaving at least about 80% identity thereto in a sample obtained from asubject; wherein the abundance of said nucleic acid sequence isindicative of a cancer of ovarian serous origin.

The invention further provides a method for classifying a cancer ofbreast origin, the method comprising measuring the relative abundance ofa nucleic acid sequence selected from the group consisting of SEQ IDNOS: 3, 5, 6, 11, 17-22, 26, 30, 39, 41, 42, or a sequence having atleast about 80% identity thereto in a sample obtained from a subject;wherein the abundance of said nucleic acid sequence is indicative of acancer of breast origin.

The invention further provides a method for classifying a cancer of lungadenocarcinoma origin, the method comprising measuring the relativeabundance of a nucleic acid sequence selected from the group consistingof SEQ ID NOS: 3, 5, 6, 8, 11, 16-22, 26, 27, 30, 37, 39, 41, 42, or asequence having at least about 80% identity thereto in a sample obtainedfrom a subject; wherein the abundance of said nucleic acid sequence isindicative of a cancer of lung adenocarcinoma origin.

The invention further provides a method for classifying a cancer ofpapillary thyroid origin, the method comprising measuring the relativeabundance of a nucleic acid sequence selected from the group consistingof SEQ ID NOS: 3, 5, 6, 8, 11, 16-22, 26, 27, 29, 30, 37-39, 41, 42, ora sequence having at least about 80% identity thereto in a sampleobtained from a subject; wherein the abundance of said nucleic acidsequence is indicative of a cancer of papillary thyroid origin.

The invention further provides a method for classifying a cancer offollicular thyroid origin, the method comprising measuring the relativeabundance of a nucleic acid sequence selected from the group consistingof SEQ ID NOS: 3, 5, 6, 8, 11, 16-22, 26, 27, 29, 30, 37-39, 41, 42, ora sequence having at least about 80% identity thereto in said sample;wherein the abundance of said nucleic acid sequence is indicative of acancer of follicular thyroid origin.

The invention further provides a method for classifying a cancer ofthymus origin, the method comprising measuring the relative abundance ofa nucleic acid sequence selected from the group consisting of SEQ IDNOS: 3, 5, 6, 11, 16-22, 26, 27, 29, 30, 35, 39, 41, 42, or a sequencehaving at least about 80% identity thereto in a sample obtained from asubject; wherein the abundance of said nucleic acid sequence isindicative of a cancer of thymus origin.

The invention further provides a method for classifying a cancer ofbladder origin, the method comprising measuring the relative abundanceof a nucleic acid sequence selected from the group consisting of SEQ IDNOS: 3-6, 11, 16-22, 26, 27, 29, 30, 35, 39, 41, 42, 44, or a sequencehaving at least about 80% identity thereto in a sample obtained from asubject; wherein the abundance of said nucleic acid sequence isindicative of a cancer of bladder origin.

The invention further provides a method for classifying a cancer of lungsquamous origin, the method comprising measuring the relative abundanceof a nucleic acid sequence selected from the group consisting of SEQ IDNOS: 3-6, 11, 16-23, 26, 27, 29, 30, 32, 35, 39, 41, 42, 44, or asequence having at least about 80% identity thereto in a sample obtainedfrom a subject; wherein the abundance of said nucleic acid sequence isindicative of a cancer of lung squamous origin.

The invention further provides a method for classifying a cancer of headand neck origin, the method comprising measuring the relative abundanceof a nucleic acid sequence selected from the group consisting of SEQ IDNOS: 3-6, 11, 14, 16-23, 26, 27, 29, 30, 32, 35, 37, 39, 41, 42, 44, 45,or a sequence having at least about 80% identity thereto in a sampleobtained from a subject; wherein the abundance of said nucleic acidsequence is indicative of a cancer of head and neck origin.

The invention further provides a method for classifying a cancer ofesophagus origin, the method comprising measuring the relative abundanceof a nucleic acid sequence selected from the group consisting of SEQ IDNOS: 3-6, 11, 14, 16-23, 26, 27, 29, 30, 32, 35, 37, 39, 41, 42, 44, 45,or a sequence having at least about 80% identity thereto in said sample;wherein the abundance of said nucleic acid sequence is indicative of acancer of esophagus origin.

According to some embodiments the nucleic acid sequence expressionprofile or relative abundance is determined by a method selected fromthe group consisting of nucleic acid hybridization and nucleic acidamplification. According to some embodiments the nucleic acidhybridization is performed using a solid-phase nucleic acid biochiparray or in situ hybridization.

According to some embodiments the nucleic acid amplification method isreal-time PCR. The real-time PCR method may comprise forward and reverseprimers. According to some embodiments the forward primer comprises asequence selected from the group consisting of SEQ ID NOS: 50-98 and150. According to some embodiments the reverse primer comprises SEQ IDNO: 288.

According to additional embodiments the real-time PCR method furthercomprises a probe. According to some embodiments the probe comprises asequence selected from the group consisting of a sequence that iscomplementary to a sequence selected from SEQ ID NOS: 1-49; a fragmentthereof and a sequence having at least about 80% identity thereto.According to additional embodiments the probe comprises a sequenceselected from the group consisting of SEQ ID NOS: 99-149 and 151.

According to another aspect, the present invention provides a kit forcancer classification, said kit comprising a probe comprising a sequenceselected from the group consisting of a sequence that is complementaryto a sequence selected from SEQ ID NOS: SEQ ID NOS: 1-49; a fragmentthereof and a sequence having at least about 80% identity thereto.

According to additional embodiments the probe comprises a sequenceselected from the group consisting of SEQ ID NOS: 99-149 and 151.

According to certain embodiments, said cancer is selected from the groupconsisting of liver cancer, biliary tract cancer, lung cancer, bladdercancer, prostate cancer, breast cancer, colon cancer, ovarian cancer,testicular cancer, stomach cancer, thyroid cancer, pancreas cancer,brain cancer, head and neck cancer, kidney cancer, melanoma, thymuscancer and esophagus cancer.

These and other embodiments of the present invention will becomeapparent in conjunction with the figures, description and claims thatfollow.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A-1C demonstrate the structure of the binary decision-treeclassifier, with 26 nodes (numbered, Table 3) and 27 leaves. Each nodeis a binary decision between two sets of samples, those to the left andright of the node. A series of binary decisions, starting at node #1 andmoving downwards, lead to one of the possible tumor types, which are the“leaves” of the tree. A sample which is classified to the left branch atnode #1 continues to node #2, otherwise it continues to node #3. Asample that reaches node #2, is further classified to either the leftbranch at node #2, and is assigned to the “liver” class, or to the rightbranch at node #2, and is assigned to the “biliary tract carcinoma”class.

Decisions are made at consecutive nodes using microRNA expressionlevels, until an end-point (“leaf” of the tree) is reached, indicatingthe predicted class for this sample. In specifying the tree structure,clinico-pathological considerations were combined with propertiesobserved in the training set data.

Developing a different classifier for e.g. male and female cases or fordifferent tumor sites would inefficiently exploit measured data andwould require unwieldy numbers of samples. Instead, exceptions werenoted for several special cases: For samples from female patients,testis or prostate origins were excluded from the KNN database, and theright branch was automatically taken in node 3 and node 16 in thedecision-tree. For samples from male patients, ovary origin was excludedand the right branch taken at node 17. For samples that were indicatedas metastases to the liver, liver origin (hepatocellular carcinoma andbiliary tract carcinomas from within the liver) was excluded and theright branch taken at node 1. For samples indicated as brain metastases,brain origin was excluded and the right branch taken at node 7.Additional information is thus incorporated into the classificationdecision without loss of generality or need to retrain the classifier.

FIG. 2 demonstrates binary decisions at node #1 of the decision-tree.When training a decision algorithm for a given node, only samples fromclasses which are possible outcomes (“leaves”) of this node are used fortraining. Tumors originating from tissues at the left branch at node #1,including tumors from the “liver” class and the “biliary tract” class(liver-cholangio; diamonds) are easily separated from tumors ofnon-liver and non-biliary tract origins (right branch at node #2; graysquares) using the expression levels of hsa-miR-200c (SEQ ID NO: 26) andhsa-miR-122 (SEQ ID NO: 6) (with one outlier), with a linear classifier(the diagonal line).

FIG. 3 demonstrates binary decisions at node #5 of the decision-tree.Tumors of epithelial origin (left branch at node #5, marked by diamonds)are easily separated from tumors of non-epithelial origin (right branchat node #5, marked by squares) using the expression levels ofhsa-miR-200c (SEQ ID NO: 26) and hsa-miR-148b (SEQ ID NO: 17). The grayarea (with higher levels of hsa-miR-200c) marks the region classified asepithelial (left branch) at this node.

FIG. 4 demonstrates binary decisions at node #7 of the decision-tree.Tumors originating in the brain (diamonds) are easily separated fromtumors of kidney origin (squares) using the expression levels ofhsa-miR-124 (SEQ ID NO: 7) and hsa-miR-9* (SEQ ID NO: 47).

FIG. 5 demonstrates binary decisions at node #10 of the decision-tree.Neuroendocrine tumors originating in the lung (diamonds) are easilyseparated from tumors of thyroid-medullary origin (squares) using theexpression levels of hsa-miR-200a (SEQ ID NO: 24) and hsa-miR-222 (SEQID NO: 32).

FIG. 6 demonstrates binary decisions at node #12 of the decision-tree.Tumors originating in the gastrointestinal tract (left branch at node#12, marked by diamonds) are easily separated from tumors of nondigestive origins (right branch at node #12, marked by squares) usingthe expression levels of hsa-miR-106a (SEQ ID NO: 3) and hsa-miR-192(SEQ ID NO: 21).

FIG. 7 demonstrates binary decisions at node #16 of the decision-tree.Tumors originating in the prostate (left branch at node #16, marked bydiamonds) are easily separated from tumors of other origins (rightbranch at node #16, marked by squares) using the expression levels ofhsa-miR-185 (SEQ ID NO: 20) and hsa-miR-375 (SEQ ID NO: 42).

FIGS. 8A-8B demonstrate classification example. FIG. 8A shows that themeasured levels (normalized C_(t), inversely proportional tolog(abundance)) of hsa-miR-200c (SEQ ID NO: 26) and hsa-miR-122 (SEQ IDNO: 6) are compared for all training set samples, indicating the leftand right branches of node #1 (circles and stars respectively). Onemetastatic tumor excised from the brain (square), from a patient thathad a concomitant tumor in the lung, and was therefore originallydiagnosed as a lung cancer. However, this sample showed anuncharacteristic high expression of hsa-miR-122, a strong hepaticmarker, and was consequently classified as possibly originating from theliver by the microRNA classifier. FIG. 8B shows that upon re-examinationof the metastatic brain tumor by immunohistochemistry (blinded to theresults of the microRNA classifier), this tumor was indeed found to benegative for lung specific markers: the sample was negative forimmunohistochemical staining by both CK7 and TTF1, as well as CK20, CEA,CA125, s-100, thyroglobulin, chromogranin, synaptophysin, CD56, GFAP,calcitonin, and anterior pituitary hormones, while staining positive forCAM5.5′ and AE1/AE3. This staining pattern was compatible withhepatocellular carcinoma, prompting further staining for HEPA1 and alphafetoprotein. The tumor stained positive for both stains, consistent witha diagnosis of hepatocellular carcinoma (FIG. 8B). H&E staining (upperpanel) showed that the metastasis is composed of sheets of cells withabundant eosinophilic cytoplasm and round to oval nuclei. Among manyimmunostains used to evaluate the origin of the tumor, HEPA-1 showedstrong and specific immunopositivity (lower panel).

DETAILED DESCRIPTION OF THE INVENTION

Identification of the tissue-of-origin of a tumor is vital to itsmanagement. The present invention is based in part on the discovery thatspecific nucleic acid sequences can be used for the identification ofthe tissue-of-origin of a tumor. The present invention provides asensitive, specific and accurate method which can be used to distinguishbetween different tissues and tumor origins. A new microRNA-basedclassifier was developed for determining tissue origin of tumors basedon a surprisingly small number of 48 microRNAs markers. The classifieruses a specific algorithm and allows a clear interpretation of thespecific biomarkers. High confidence predictions reach 90% sensitivityand 99% specificity.

According to the present invention each node in the classification treemay be used as an independent differential diagnosis tool, for examplein the identification of different types of lung cancer. The performanceof the classifier using a small number of markers highlights the utilityof microRNA as tissue-specific cancer biomarkers, and provides aneffective means for facilitating diagnosis of CUP and more generally ofidentifying tumor origins of metastases.

The possibility to distinguish between different tumor originsfacilitates providing the patient with the best and most suitabletreatment.

The present invention provides diagnostic assays and methods, bothquantitative and qualitative for detecting, diagnosing, monitoring,staging and prognosticating cancers by comparing the levels of thespecific microRNA molecules of the invention. Such levels are preferablymeasured in at least one of biopsies, tumor samples, fine-needleaspiration (FNA), cells, tissues and/or bodily fluids. The presentinvention provides methods for diagnosing the presence of a specificcancer by analyzing the levels of said microRNA molecules in biopsies,tumor samples, cells, tissues or bodily fluids.

In the present invention, determining the levels of said microRNA inbiopsies, tumor samples, cells, tissues or bodily fluid, is particularlyuseful for discriminating between different cancers.

All the methods of the present invention may optionally further includemeasuring levels of other cancer markers. Other cancer markers, inaddition to said microRNA molecules, useful in the present inventionwill depend on the cancer being tested and are known to those of skillin the art.

Assay techniques that can be used to determine levels of geneexpression, such as the nucleic acid sequence of the present invention,in a sample derived from a patient are well known to those of skill inthe art. Such assay methods include, but are not limited to, reversetranscriptase PCR (RT-PCR) assays, nucleic acid microarrays and biochipanalysis, immunohistochemistry assays, in situ hybridization assays,competitive-binding assays, northern blot analyses and ELISA assays.

According to one embodiment, the assay is based on expression level of48 microRNAs in RNA extracted from FFPE metastatic tumor tissue. Thetest is a quantitative real time reverse transcriptase polymerase chainreaction (qRT-PCR) test. RNA is first polyadenylated and then reversetranscribed using universal poly(T) adapter to create cDNA. The cDNA isamplified using specific forward primer and universal reverse primer(with a sequence complementary to the 5′ tail of the poly(T) adapter),and detected by specific MGB probes (see specific sequences in Table 1).

The expression levels are used to infer the sample origin using analysistechniques such as but not limit to decision tree classifier, logisticregression classifier, linear regression classifier, nearest neighborclassifier (including K nearest neighbors), neural network classifierand nearest centroid classifier.

The expression levels are used to make binary decisions (at eachrelevant node) following the pre-defined structure of the binarydecision-tree (defined using the training set). At each node, theexpressions of one or several microRNAs are combined together using asimple function of the form P=exp (b0+b1*mir1+b2*mir2+b3*mir3 . . . ),where the values of b0, b1, b2 . . . and the identities of the microRNAshave been pre-determined (using the training set). The resulting P iscompared to a threshold level PTH (which was also determined using thetraining set), and the classification continues to the left or rightbranch according to whether P is larger or smaller than PTH for thatnode. This continues until an end-point (“leaf”) of the tree is reached.

Training the tree algorithm means determining: the tree structure (whichnodes there are and what is on each side), which miRs are used in eachnode and the values of b0, b1, b2 . . . and PTH. These were determinedby a combination of machine learning, optimization algorithm, and trialand error by experts in machine learning and diagnostic algorithms.

In some embodiments of the invention, correlations and/or hierarchicalclustering can be used to assess the similarity of the expression levelof the nucleic acid sequences of the invention between a specific sampleand different exemplars of cancer samples. An arbitrary threshold on theexpression level of one or more nucleic acid sequences can be set forassigning a sample or cancer sample to one of two groups. Alternatively,in a preferred embodiment, expression levels of one or more nucleic acidsequences of the invention are combined by a method such as logisticregression to define a metric which is then compared to previouslymeasured samples or to a threshold. The threshold for assignment istreated as a parameter, which can be used to quantify the confidencewith which samples are assigned to each class. The threshold forassignment can be scaled to favor sensitivity or specificity, dependingon the clinical scenario. The correlation value to the reference datagenerates a continuous score that can be scaled and provides diagnosticinformation on the likelihood that a sample belongs to a certain classof cancer origin or type. In multivariate analysis, the microRNAsignature provides a high level of prognostic information.

In another preferred embodiment, expression level of the nucleic acidsis used to classify a test sample by comparison to a training set ofsamples. In this embodiment, the test sample is compared in turn to eachone of the training set samples. Each such pairwise comparison isperformed by comparing the expression levels of one or multiple nucleicacids between the test sample and the specific training sample. Eachsuch pairwise comparison generates a combined metric for the multiplenucleic acids, which can be calculated by various numeric methods suchas correlation, cosine, Euclidian distance, mean square distance, orother methods known to those skilled in the art. The training samplesare then ranked according to this metric, and the samples with thehighest values of the metric (or lowest values, according to the type ofmetric) are identified, indicating those samples that are most similarto the test sample. By choosing a parameter K, this generates a listthat includes the K training samples that are most similar to the testsample. Various methods can then be applied to identify from this listthe predicted class of the test sample. In a favored embodiment, thetest sample is predicted to belong to the class that has the highestnumber of representative in the list of K most-similar training samples(this method is known as the K Nearest Neighbors method). Otherembodiments may provide a list of predictions including all or part ofthe classes represented in the list, those classes that are representedmore than a given minimum number of times, or other voting schemeswhereby classes are grouped together.

Definitions

It is to be understood that the terminology used herein is for thepurpose of describing particular embodiments only and is not intended tobe limiting. It must be noted that, as used in the specification and theappended claims, the singular forms “a,” “an” and “the” include pluralreferents unless the context clearly dictates otherwise.

For the recitation of numeric ranges herein, each intervening numberthere between with the same degree of precision is explicitlycontemplated. For example, for the range of 6-9, the numbers 7 and 8 arecontemplated in addition to 6 and 9, and for the range 6.0-7.0, thenumber 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9 and 7.0 areexplicitly contemplated.

About

As used herein, the term “about” refers to +/−10%.

Attached

“Attached” or “immobilized”, as used herein, to refer to a probe and asolid support means that the binding between the probe and the solidsupport is sufficient to be stable under conditions of binding, washing,analysis, and removal. The binding may be covalent or non-covalent.Covalent bonds may be formed directly between the probe and the solidsupport or may be formed by a cross linker or by inclusion of a specificreactive group on either the solid support or the probe or bothmolecules. Non-covalent binding may be one or more of electrostatic,hydrophilic, and hydrophobic interactions. Included in non-covalentbinding is the covalent attachment of a molecule, such as streptavidin,to the support and the non-covalent binding of a biotinylated probe tothe streptavidin. Immobilization may also involve a combination ofcovalent and non-covalent interactions.

Baseline

“Baseline”, as used herein, means the initial cycles of PCR, in whichthere is little change in fluorescence signal.

Biological Sample

“Biological sample”, as used herein, means a sample of biological tissueor fluid that comprises nucleic acids. Such samples include, but are notlimited to, tissue or fluid isolated from subjects. Biological samplesmay also include sections of tissues such as biopsy and autopsy samples,FFPE samples, frozen sections taken for histological purposes, blood,blood fraction, plasma, serum, sputum, stool, tears, mucus, hair, skin,urine, effusions, ascitic fluid, amniotic fluid, saliva, cerebrospinalfluid, cervical secretions, vaginal secretions, endometrial secretions,gastrointestinal secretions, bronchial secretions, cell line, tissuesample, or secretions from the breast. A biological sample may beprovided by fine-needle aspiration (FNA). A biological sample may beprovided by removing a sample of cells from a subject but can also beaccomplished by using previously isolated cells (e.g., isolated byanother person, at another time, and/or for another purpose), or byperforming the methods described herein in vivo. Archival tissues, suchas those having treatment or outcome history, may also be used.Biological samples also include explants and primary and/or transformedcell cultures derived from animal or human tissues.

Cancer

The term “cancer” is meant to include all types of cancerous growths oroncogenic processes, metastatic tissues or malignantly transformedcells, tissues, or organs, irrespective of histopathologic type or stageof invasiveness. Examples of cancers include, but are not limited, tosolid tumors and leukemias, including: apudoma, choristoma, branchioma,malignant carcinoid syndrome, carcinoid heart disease, carcinoma (e.g.,Walker, basal cell, basosquamous, Brown-Pearce, ductal, Ehrlich tumor,non-small cell lung (e.g., lung squamous cell carcinoma, lungadenocarcinoma and lung undifferentiated large cell carcinoma), oatcell, papillary, bronchiolar, bronchogenic, squamous cell, andtransitional cell), histiocytic disorders, leukemia (e.g., B cell, mixedcell, null cell, T cell, T-cell chronic, HTLV-II-associated, lymphocyticacute, lymphocytic chronic, mast cell, and myeloid), histiocytosismalignant, Hodgkin disease, immunoproliferative small, non-Hodgkinlymphoma, plasmacytoma, reticuloendotheliosis, melanoma,chondroblastoma, chondroma, chondrosarcoma, fibroma, fibrosarcoma, giantcell tumors, histiocytoma, lipoma, liposarcoma, mesothelioma, myxoma,myxosarcoma, osteoma, osteosarcoma, Ewing sarcoma, synovioma,adenofibroma, adenolymphoma, carcinosarcoma, chordoma,craniopharyngioma, dysgerminoma, hamartoma, mesenchymoma, mesonephroma,myosarcoma, ameloblastoma, cementoma, odontoma, teratoma, thymoma,trophoblastic tumor, adeno-carcinoma, adenoma, cholangioma,cholesteatoma, cylindroma, cystadenocarcinoma, cystadenoma, granulosacell tumor, gynandroblastoma, hepatoma, hidradenoma, islet cell tumor,Leydig cell tumor, papilloma, Sertoli cell tumor, theca cell tumor,leiomyoma, leiomyosarcoma, myoblastoma, myosarcoma, rhabdomyoma,rhabdomyosarcoma, ependymoma, ganglioneuroma, glioma, medulloblastoma,meningioma, neurilemmoma, neuroblastoma, neuroepithelioma, neurofibroma,neuroma, paraganglioma, paraganglioma nonchromaffin, angiokeratoma,angiolymphoid hyperplasia with eosinophilia, angioma sclerosing,angiomatosis, glomangioma, hemangioendothelioma, hemangioma,hemangiopericytoma, hemangiosarcoma, lymphangioma, lymphangiomyoma,lymphangiosarcoma, pinealoma, carcinosarcoma, chondrosarcoma,cystosarcoma, phyllodes, fibrosarcoma, hemangiosarcoma, leimyosarcoma,leukosarcoma, liposarcoma, lymphangiosarcoma, myosarcoma, myxosarcoma,ovarian carcinoma, rhabdomyosarcoma, sarcoma (e.g., Ewing, experimental,Kaposi, and mast cell), neurofibromatosis, and cervical dysplasia, andother conditions in which cells have become immortalized or transformed.

Classification

The term classification refers to a procedure and/or algorithm in whichindividual items are placed into groups or classes based on quantitativeinformation on one or more characteristics inherent in the items(referred to as traits, variables, characters, features, etc.) and basedon a statistical model and/or a training set of previously labeleditems. A “classification tree” is a decision tree that placescategorical variables into classes.

Complement

“Complement” or “complementary” is used herein to refer to a nucleicacid may mean Watson-Crick (e.g., A-T/U and C-G) or Hoogsteen basepairing between nucleotides or nucleotide analogs of nucleic acidmolecules. A full complement or fully complementary means 100%complementary base pairing between nucleotides or nucleotide analogs ofnucleic acid molecules. In some embodiments, the complementary sequencehas a reverse orientation (5′-3′).

Ct

Ct signals represent the first cycle of PCR where amplification crossesa threshold (cycle threshold) of fluorescence. Accordingly, low valuesof Ct represent high abundance or expression levels of the microRNA.

In some embodiments the PCR Ct signal is normalized such that thenormalized Ct remains inversed from the expression level. In otherembodiments the PCR Ct signal may be normalized and then inverted suchthat low normalized-inverted Ct represents low abundance or expressionlevels of the microRNA.

Data Processing Routine

As used herein, a “data processing routine” refers to a process that canbe embodied in software that determines the biological significance ofacquired data (i.e., the ultimate results of an assay or analysis). Forexample, the data processing routine can make determination of tissue oforigin based upon the data collected. In the systems and methods herein,the data processing routine can also control the data collection routinebased upon the results determined. The data processing routine and thedata collection routines can be integrated and provide feedback tooperate the data acquisition, and hence provide assay-based judgingmethods.

Data Set

As use herein, the term “data set” refers to numerical values obtainedfrom the analysis. These numerical values associated with analysis maybe values such as peak height and area under the curve.

Data Structure

As used herein, the term “data structure” refers to a combination of twoor more data sets, applying one or more mathematical manipulations toone or more data sets to obtain one or more new data sets, ormanipulating two or more data sets into a form that provides a visualillustration of the data in a new way. An example of a data structureprepared from manipulation of two or more data sets would be ahierarchical cluster.

Detection

“Detection” means detecting the presence of a component in a sample.Detection also means detecting the absence of a component. Detectionalso means determining the level of a component, either quantitativelyor qualitatively.

Differential Expression

“Differential expression” means qualitative or quantitative differencesin the temporal and/or spatial gene expression patterns within and amongcells and tissue. Thus, a differentially expressed gene mayqualitatively have its expression altered, including an activation orinactivation, in, e.g., normal versus diseased tissue. Genes may beturned on or turned off in a particular state, relative to anotherstate, thus permitting comparison of two or more states. A qualitativelyregulated gene may exhibit an expression pattern within a state or celltype which may be detectable by standard techniques. Some genes may beexpressed in one state or cell type, but not in both. Alternatively, thedifference in expression may be quantitative, e.g., in that expressionis modulated, up-regulated, resulting in an increased amount oftranscript, or down-regulated, resulting in a decreased amount oftranscript. The degree to which expression differs needs only to belarge enough to quantify via standard characterization techniques suchas expression arrays, quantitative reverse transcriptase PCR, northernblot analysis, real-time PCR, in situ hybridization and RNaseprotection.

Expression Profile

The term “expression profile” is used broadly to include a genomicexpression profile, e.g., an expression profile of microRNAs. Profilesmay be generated by any convenient means for determining a level of anucleic acid sequence, e.g., quantitative hybridization of microRNA,labeled microRNA, amplified microRNA, cDNA, etc., quantitative PCR,ELISA for quantitation, and the like, and allow the analysis ofdifferential gene expression between two samples. A subject or patienttumor sample, e.g., cells or collections thereof, e.g., tissues, isassayed. Samples are collected by any convenient method, as known in theart. Nucleic acid sequences of interest are nucleic acid sequences thatare found to be predictive, including the nucleic acid sequencesprovided above, where the expression profile may include expression datafor 5, 10, 20, 25, 50, 100 or more of the nucleic acid sequences,including all of the listed nucleic acid sequences. According to someembodiments, the term “expression profile” means measuring the relativeabundance of the nucleic acid sequences in the measured samples.

Expression Ratio

“Expression ratio”, as used herein, refers to relative expression levelsof two or more nucleic acids as determined by detecting the relativeexpression levels of the corresponding nucleic acids in a biologicalsample.

FDR

When performing multiple statistical tests, for example in comparing thesignal between two groups in multiple data features, there is anincreasingly high probability of obtaining false positive results, byrandom differences between the groups that can reach levels that wouldotherwise be considered statistically significant. In order to limit theproportion of such false discoveries, statistical significance isdefined only for data features in which the differences reached ap-value (by two-sided t-test) below a threshold, which is dependent onthe number of tests performed and the distribution of p-values obtainedin these tests.

Fragment

“Fragment” is used herein to indicate a non-full-length part of anucleic acid. Thus, a fragment is itself also a nucleic acid.

Gene

“Gene”, as used herein, may be a natural (e.g., genomic) or syntheticgene comprising transcriptional and/or translational regulatorysequences and/or a coding region and/or non-translated sequences (e.g.,introns, 5′- and 3′-untranslated sequences). The coding region of a genemay be a nucleotide sequence coding for an amino acid sequence or afunctional RNA, such as tRNA, rRNA, catalytic RNA, siRNA, miRNA orantisense RNA. A gene may also be an mRNA or cDNA corresponding to thecoding regions (e.g., exons and miRNA) optionally comprising 5′- or3′-untranslated sequences linked thereto. A gene may also be anamplified nucleic acid molecule produced in vitro, comprising all or apart of the coding region and/or 5′- or 3′-untranslated sequences linkedthereto.

Groove Binder/Minor Groove Binder (MGB)

“Groove binder” and/or “minor groove binder” may be used interchangeablyand refer to small molecules that fit into the minor groove ofdouble-stranded DNA, typically in a sequence-specific manner. Minorgroove binders may be long, flat molecules that can adopt acrescent-like shape and thus fit snugly into the minor groove of adouble helix, often displacing water. Minor groove binding molecules maytypically comprise several aromatic rings connected by bonds withtorsional freedom such as furan, benzene, or pyrrole rings. Minor groovebinders may be antibiotics such as netropsin, distamycin, berenil,pentamidine and other aromatic diamidines, Hoechst 33258, SN 6999,aureolic anti-tumor drugs such as chromomycin and mithramycin, CC-1065,dihydrocyclopyrroloindole tripeptide (DPI₃),1,2-dihydro-(3H)-pyrrolo[3,2-e]indole-7-carboxylate (CDPI₃), and relatedcompounds and analogues, including those described in Nucleic Acids inChemistry and Biology, 2nd ed., Blackburn and Gait, eds., OxfordUniversity Press, 1996, and PCT Published Application No. WO 03/078450,the contents of which are incorporated herein by reference. A minorgroove binder may be a component of a primer, a probe, a hybridizationtag complement, or combinations thereof. Minor groove binders mayincrease the T_(m) of the primer or a probe to which they are attached,allowing such primers or probes to effectively hybridize at highertemperatures.

Host Cell

“Host cell”, as used herein, may be a naturally occurring cell or atransformed cell that may contain a vector and may support replicationof the vector. Host cells may be cultured cells, explants, cells invivo, and the like. Host cells may be prokaryotic cells, such as E.coli, or eukaryotic cells, such as yeast, insect, amphibian, ormammalian cells, such as CHO and HeLa cells.

Identity

“Identical” or “identity”, as used herein, in the context of two or morenucleic acids or polypeptide sequences mean that the sequences have aspecified percentage of residues that are the same over a specifiedregion. The percentage may be calculated by optimally aligning the twosequences, comparing the two sequences over the specified region,determining the number of positions at which the identical residueoccurs in both sequences to yield the number of matched positions,dividing the number of matched positions by the total number ofpositions in the specified region, and multiplying the result by 100 toyield the percentage of sequence identity. In cases where the twosequences are of different lengths or the alignment produces one or morestaggered ends and the specified region of comparison includes only asingle sequence, the residues of single sequence are included in thedenominator but not the numerator of the calculation. When comparing DNAand RNA sequences, thymine (T) and uracil (U) may be consideredequivalent. Identity may be performed manually or by using a computersequence algorithm such as BLAST or BLAST 2.0.

In Situ Detection

“In situ detection”, as used herein, means the detection of expressionor expression levels in the original site, hereby meaning in a tissuesample such as biopsy.

k-Nearest Neighbor

The phrase “k-nearest neighbor” refers to a classification method thatclassifies a point by calculating the distances between the point andpoints in the training data set. It then assigns the point to the classthat is most common among its k-nearest neighbors (where k is aninteger).

Label

“Label”, as used herein, means a composition detectable byspectroscopic, photochemical, biochemical, immunochemical, chemical, orother physical means. For example, useful labels include ³²P,fluorescent dyes, electron-dense reagents, enzymes (e.g., as commonlyused in an ELISA), biotin, digoxigenin, or haptens and other entitieswhich can be made detectable. A label may be incorporated into nucleicacids and proteins at any position.

Logistic Regression

Logistic regression is part of a category of statistical models calledgeneralized linear models. Logistic regression can allow one to predicta discrete outcome, such as group membership, from a set of variablesthat may be continuous, discrete, dichotomous, or a mix of any of these.The dependent or response variable can be dichotomous, for example, oneof two possible types of cancer. Logistic regression models the naturallog of the odds ratio, i.e., the ratio of the probability of belongingto the first group (P) over the probability of belonging to the secondgroup (1−P), as a linear combination of the different expression levels(in log-space). The logistic regression output can be used as aclassifier by prescribing that a case or sample will be classified intothe first type if P is greater than 0.5 or 50%. Alternatively, thecalculated probability P can be used as a variable in other contexts,such as a 1D or 2D threshold classifier.

1D/2D Threshold Classifier

“1D/2D threshold classifier”, as used herein, may mean an algorithm forclassifying a case or sample such as a cancer sample into one of twopossible types such as two types of cancer. For a 1D thresholdclassifier, the decision is based on one variable and one predeterminedthreshold value; the sample is assigned to one class if the variableexceeds the threshold and to the other class if the variable is lessthan the threshold. A 2D threshold classifier is an algorithm forclassifying into one of two types based on the values of two variables.A threshold may be calculated as a function (usually a continuous oreven a monotonic function) of the first variable; the decision is thenreached by comparing the second variable to the calculated threshold,similar to the 1D threshold classifier.

Metastasis

“Metastasis” means the process by which cancer spreads from the place atwhich it first arose as a primary tumor to other locations in the body.The metastatic progression of a primary tumor reflects multiple stages,including dissociation from neighboring primary tumor cells, survival inthe circulation, and growth in a secondary location.

Node

A “node” is a decision point in a classification (i.e., decision) tree.Also, a point in a neural net that combines input from other nodes andproduces an output through application of an activation function. A“leaf” is a node not further split, the terminal grouping in aclassification or decision tree.

Nucleic Acid

“Nucleic acid” or “oligonucleotide” or “polynucleotide”, as used herein,mean at least two nucleotides covalently linked together. The depictionof a single strand also defines the sequence of the complementarystrand. Thus, a nucleic acid also encompasses the complementary strandof a depicted single strand. Many variants of a nucleic acid may be usedfor the same purpose as a given nucleic acid. Thus, a nucleic acid alsoencompasses substantially identical nucleic acids and complementsthereof. A single strand provides a probe that may hybridize to a targetsequence under stringent hybridization conditions. Thus, a nucleic acidalso encompasses a probe that hybridizes under stringent hybridizationconditions.

Nucleic acids may be single-stranded or double-stranded, or may containportions of both double-stranded and single-stranded sequences. Thenucleic acid may be DNA, both genomic and cDNA, RNA, or a hybrid, wherethe nucleic acid may contain combinations of deoxyribo- andribo-nucleotides, and combinations of bases including uracil, adenine,thymine, cytosine, guanine, inosine, xanthine hypoxanthine, isocytosineand isoguanine. Nucleic acids may be obtained by chemical synthesismethods or by recombinant methods.

A nucleic acid will generally contain phosphodiester bonds, althoughnucleic acid analogs may be included that may have at least onedifferent linkage, e.g., phosphoramidate, phosphorothioate,phosphorodithioate, or O-methylphosphoroamidite linkages and peptidenucleic acid backbones and linkages. Other analog nucleic acids includethose with positive backbones, non-ionic backbones and non-ribosebackbones, including those described in U.S. Pat. Nos. 5,235,033 and5,034,506, which are incorporated herein by reference. Nucleic acidscontaining one or more non-naturally occurring or modified nucleotidesare also included within one definition of nucleic acids. The modifiednucleotide analog may be located for example at the 5′-end and/or the3′-end of the nucleic acid molecule. Representative examples ofnucleotide analogs may be selected from sugar- or backbone-modifiedribonucleotides. It should be noted, however, that alsonucleobase-modified ribonucleotides, i.e., ribonucleotides, containing anon-naturally occurring nucleobase instead of a naturally occurringnucleobase such as uridine or cytidine modified at the 5-position, e.g.,5-(2-amino) propyl uridine, 5-bromo uridine; adenosine and guanosinemodified at the 8-position, e.g., 8-bromo guanosine; deaza nucleotides,e.g., 7-deaza-adenosine; O- and N-alkylated nucleotides, e.g., N6-methyladenosine are suitable. The 2′-OH-group may be replaced by a groupselected from H, OR, R, halo, SH, SR, NH₂, NHR, NR₂ or CN, wherein R isC1-C6 alkyl, alkenyl or alkynyl and halo is F, Cl, Br or I. Modifiednucleotides also include nucleotides conjugated with cholesterolthrough, e.g., a hydroxyprolinol linkage as described in Krutzfeldt etal., Nature 2005; 438:685-689, Soutschek et al., Nature 2004;432:173-178, and U.S. Patent Publication No. 20050107325, which areincorporated herein by reference. Additional modified nucleotides andnucleic acids are described in U.S. Patent Publication No. 20050182005,which is incorporated herein by reference. Modifications of theribose-phosphate backbone may be done for a variety of reasons, e.g., toincrease the stability and half-life of such molecules in physiologicalenvironments, to enhance diffusion across cell membranes, or as probeson a biochip. The backbone modification may also enhance resistance todegradation, such as in the harsh endocytic environment of cells. Thebackbone modification may also reduce nucleic acid clearance byhepatocytes, such as in the liver and kidney. Mixtures of naturallyoccurring nucleic acids and analogs may be made; alternatively, mixturesof different nucleic acid analogs, and mixtures of naturally occurringnucleic acids and analogs may be made.

Probe

“Probe”, as used herein, means an oligonucleotide capable of binding toa target nucleic acid of complementary sequence through one or moretypes of chemical bonds, usually through complementary base pairing,usually through hydrogen bond formation. Probes may bind targetsequences lacking complete complementarity with the probe sequencedepending upon the stringency of the hybridization conditions. There maybe any number of base pair mismatches which will interfere withhybridization between the target sequence and the single-strandednucleic acids described herein. However, if the number of mutations isso great that no hybridization can occur under even the least stringentof hybridization conditions, the sequence is not a complementary targetsequence. A probe may be single-stranded or partially single- andpartially double-stranded. The strandedness of the probe is dictated bythe structure, composition, and properties of the target sequence.Probes may be directly labeled or indirectly labeled such as with biotinto which a streptavidin complex may later bind.

Reference Value

As used herein, the term “reference value” or “reference expressionprofile” refers to a criterion expression value to which measured valuesare compared in order to determine the detection of a specific cancer.The reference value may be based on the abundance of the nucleic acids,or may be based on a combined metric score thereof.

In preferred embodiments the reference value is determined fromstatistical analysis of studies that compare microRNA expression withknown clinical outcomes.

Sensitivity

“Sensitivity”, as used herein, may mean a statistical measure of howwell a binary classification test correctly identifies a condition, forexample, how frequently it correctly classifies a cancer into thecorrect type out of two possible types. The sensitivity for class A isthe proportion of cases that are determined to belong to class “A” bythe test out of the cases that are in class “A”, as determined by someabsolute or gold standard.

Specificity

“Specificity”, as used herein, may mean a statistical measure of howwell a binary classification test correctly identifies a condition, forexample, how frequently it correctly classifies a cancer into thecorrect type out of two possible types. The sensitivity for class A isthe proportion of cases that are determined to belong to class “not A”by the test out of the cases that are in class “not A”, as determined bysome absolute or gold standard.

Stringent Hybridization Conditions

“Stringent hybridization conditions”, as used herein, mean conditionsunder which a first nucleic acid sequence (e.g., probe) will hybridizeto a second nucleic acid sequence (e.g., target), such as in a complexmixture of nucleic acids. Stringent conditions are sequence-dependentand will be different in different circumstances. Stringent conditionsmay be selected to be about 5-10° C. lower than the thermal meltingpoint (T_(m)) for the specific sequence at a defined ionic strength pH.The T_(m) may be the temperature (under defined ionic strength, pH, andnucleic concentration) at which 50% of the probes complementary to thetarget hybridize to the target sequence at equilibrium (as the targetsequences are present in excess, at T_(m), 50% of the probes areoccupied at equilibrium). Stringent conditions may be those in which thesalt concentration is less than about 1.0 M sodium ion, such as about0.01-1.0 M sodium ion concentration (or other salts) at pH 7.0 to 8.3and the temperature is at least about 30° C. for short probes (e.g.,about 10-50 nucleotides) and at least about 60° C. for long probes(e.g., greater than about 50 nucleotides). Stringent conditions may alsobe achieved with the addition of destabilizing agents such as formamide.For selective or specific hybridization, a positive signal may be atleast 2 to 10 times background hybridization. Exemplary stringenthybridization conditions include the following: 50% formamide, 5×SSC,and 1% SDS, incubating at 42° C., or, 5×SSC, 1% SDS, incubating at 65°C., with wash in 0.2×SSC, and 0.1% SDS at 65° C.

Substantially Complementary

“Substantially complementary”, as used herein, means that a firstsequence is at least 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 98% or99% identical to the complement of a second sequence over a region of 8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30,35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100 or morenucleotides, or that the two sequences hybridize under stringenthybridization conditions.

Substantially Identical

“Substantially identical”, as used herein, means that a first and asecond sequence are at least 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%,97%, 98% or 99% identical over a region of 8, 9, 10, 11, 12, 13, 14, 15,16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60, 65,70, 75, 80, 85, 90, 95, 100 or more nucleotides or amino acids, or withrespect to nucleic acids, if the first sequence is substantiallycomplementary to the complement of the second sequence.

Subject

As used herein, the term “subject” refers to a mammal, including bothhuman and other mammals. The methods of the present invention arepreferably applied to human subjects.

Target Nucleic Acid

“Target nucleic acid”, as used herein, means a nucleic acid or variantthereof that may be bound by another nucleic acid. A target nucleic acidmay be a DNA sequence. The target nucleic acid may be RNA. The targetnucleic acid may comprise a mRNA, tRNA, shRNA, siRNA or Piwi-interactingRNA, or a pri-miRNA, pre-miRNA, miRNA, or anti-miRNA.

The target nucleic acid may comprise a target miRNA binding site or avariant thereof. One or more probes may bind the target nucleic acid.The target binding site may comprise 5-100 or 10-60 nucleotides. Thetarget binding site may comprise a total of 5, 6, 7, 8, 9, 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,30-40, 40-50, 50-60, 61, 62 or 63 nucleotides. The target site sequencemay comprise at least 5 nucleotides of the sequence of a target miRNAbinding site disclosed in U.S. patent application Ser. No. 11/384,049,11/418,870 or 11/429,720, the contents of which are incorporated herein.

Threshold

As used herein, the term “threshold” means the numerical value assignedfor each run, which reflects a statistically significant point above thecalculated PCR baseline.

Tissue Sample

As used herein, a tissue sample is tissue obtained from a tissue biopsyusing methods well known to those of ordinary skill in the relatedmedical arts. The phrase “suspected of being cancerous”, as used herein,means a cancer tissue sample believed by one of ordinary skill in themedical arts to contain cancerous cells. Methods for obtaining thesample from the biopsy include gross apportioning of a mass,microdissection, laser-based microdissection, or other art-knowncell-separation methods.

Tumor

“Tumor”, as used herein, refers to all neoplastic cell growth andproliferation, whether malignant or benign, and all pre-cancerous andcancerous cells and tissues.

Variant

“Variant”, as used herein, referring to a nucleic acid means (i) aportion of a referenced nucleotide sequence; (ii) the complement of areferenced nucleotide sequence or portion thereof; (iii) a nucleic acidthat is substantially identical to a referenced nucleic acid or thecomplement thereof; or (iv) a nucleic acid that hybridizes understringent conditions to the referenced nucleic acid, complement thereof,or a sequence substantially identical thereto.

Wild Type

As used herein, the term “wild-type” sequence refers to a coding, anon-coding or an interface sequence which is an allelic form of sequencethat performs the natural or normal function for that sequence.Wild-type sequences include multiple allelic forms of a cognatesequence, for example, multiple alleles of a wild type sequence mayencode silent or conservative changes to the protein sequence that acoding sequence encodes.

The present invention employs miRNAs for the identification,classification and diagnosis of specific cancers and the identificationof their tissues of origin.

1. microRNA Processing

A gene coding for microRNA (miRNA) may be transcribed leading toproduction of a miRNA primary transcript known as the pri-miRNA. Thepri-miRNA may comprise a hairpin with a stem and loop structure. Thestem of the hairpin may comprise mismatched bases. The pri-miRNA maycomprise several hairpins in a polycistronic structure.

The hairpin structure of the pri-miRNA may be recognized by Drosha,which is an RNase III endonuclease. Drosha may recognize terminal loopsin the pri-miRNA and cleave approximately two helical turns into thestem to produce a 60-70 nt precursor known as the pre-miRNA. Drosha maycleave the pri-miRNA with a staggered cut typical of RNase IIIendonucleases yielding a pre-miRNA stem loop with a 5′ phosphate and ˜2nucleotide 3′ overhang. Approximately one helical turn of stem (˜10nucleotides) extending beyond the Drosha cleavage site may be essentialfor efficient processing. The pre-miRNA may then be actively transportedfrom the nucleus to the cytoplasm by Ran-GTP and the export receptorEx-portin-5.

The pre-miRNA may be recognized by Dicer, which is also an RNase IIIendonuclease. Dicer may recognize the double-stranded stem of thepre-miRNA. Dicer may also cut off the terminal loop two helical turnsaway from the base of the stem loop, leaving an additional 5′ phosphateand a ˜2 nucleotide 3′ overhang. The resulting siRNA-like duplex, whichmay comprise mismatches, comprises the mature miRNA and a similar-sizedfragment known as the miRNA*. The miRNA and miRNA* may be derived fromopposing arms of the pri-miRNA and pre-miRNA. MiRNA* sequences may befound in libraries of cloned miRNAs, but typically at lower frequencythan the miRNAs.

Although initially present as a double-stranded species with miRNA*, themiRNA may eventually become incorporated as a single-stranded RNA into aribonucleoprotein complex known as the RNA-induced silencing complex(RISC). Various proteins can form the RISC, which can lead tovariability in specificity for miRNA/miRNA* duplexes, binding site ofthe target gene, activity of miRNA (repress or activate), and whichstrand of the miRNA/miRNA* duplex is loaded in to the RISC.

When the miRNA strand of the miRNA:miRNA* duplex is loaded into theRISC, the miRNA* may be removed and degraded. The strand of themiRNA:miRNA* duplex that is loaded into the RISC may be the strand whose5′ end is less tightly paired. In cases where both ends of themiRNA:miRNA* have roughly equivalent 5′ pairing, both miRNA and miRNA*may have gene silencing activity.

The RISC may identify target nucleic acids based on high levels ofcomplementarity between the miRNA and the mRNA, especially bynucleotides 2-7 of the miRNA. Only one case has been reported in animalswhere the interaction between the miRNA and its target was along theentire length of the miRNA. This was shown for miR-196 and Hox B8 and itwas further shown that miR-196 mediates the cleavage of the Hox B8 mRNA(Yekta et al. Science 2004; 304:594-596). Otherwise, such interactionsare known only in plants (Bartel & Bartel 2003; 132:709-717).

A number of studies have looked at the base-pairing requirement betweenmiRNA and its mRNA target for achieving efficient inhibition oftranslation (reviewed by Bartel 2004; 116:281-297). In mammalian cells,the first 8 nucleotides of the miRNA may be important (Doench & SharpGenesDev 2004; 18:504-511). However, other parts of the microRNA mayalso participate in mRNA binding. Moreover, sufficient base pairing atthe 3′ can compensate for insufficient pairing at the 5′ (Brennecke etal., PloS Biol 2005; 3:e85). Computation studies, analyzing miRNAbinding on whole genomes have suggested a specific role for bases 2-7 atthe 5′ of the miRNA in target binding but the role of the firstnucleotide, found usually to be “A” was also recognized (Lewis et al.Cell 2005; 120:15-20). Similarly, nucleotides 1-7 or 2-8 were used toidentify and validate targets by Krek et al. (Nat Genet 2005;37:495-500).

The target sites in the mRNA may be in the 5′ UTR, the 3′ UTR or in thecoding region. Interestingly, multiple miRNAs may regulate the same mRNAtarget by recognizing the same or multiple sites. The presence ofmultiple miRNA binding sites in most genetically identified targets mayindicate that the cooperative action of multiple RISCs provides the mostefficient translational inhibition.

miRNAs may direct the RISC to down-regulate gene expression by either oftwo mechanisms: mRNA cleavage or translational repression. The miRNA mayspecify cleavage of the mRNA if the mRNA has a certain degree ofcomplementarity to the miRNA. When a miRNA guides cleavage, the cut maybe between the nucleotides pairing to residues 10 and 11 of the miRNA.Alternatively, the miRNA may repress translation if the miRNA does nothave the requisite degree of complementarity to the miRNA. Translationalrepression may be more prevalent in animals since animals may have alower degree of complementarity between the miRNA and binding site.

It should be noted that there may be variability in the 5′ and 3′ endsof any pair of miRNA and miRNA*. This variability may be due tovariability in the enzymatic processing of Drosha and Dicer with respectto the site of cleavage. Variability at the 5′ and 3′ ends of miRNA andmiRNA* may also be due to mismatches in the stem structures of thepri-miRNA and pre-miRNA. The mismatches of the stem strands may lead toa population of different hairpin structures. Variability in the stemstructures may also lead to variability in the products of cleavage byDrosha and Dicer.

2. Nucleic Acids

Nucleic acids are provided herein. The nucleic acids comprise thesequences of SEQ ID NOS: 1-288 or variants thereof. The variant may be acomplement of the referenced nucleotide sequence. The variant may alsobe a nucleotide sequence that is substantially identical to thereferenced nucleotide sequence or the complement thereof. The variantmay also be a nucleotide sequence which hybridizes under stringentconditions to the referenced nucleotide sequence, complements thereof,or nucleotide sequences substantially identical thereto.

The nucleic acid may have a length of from about 10 to about 250nucleotides. The nucleic acid may have a length of at least 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,35, 40, 45, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200 or 250nucleotides. The nucleic acid may be synthesized or expressed in a cell(in vitro or in vivo) using a synthetic gene described herein. Thenucleic acid may be synthesized as a single-strand molecule andhybridized to a substantially complementary nucleic acid to form aduplex. The nucleic acid may be introduced to a cell, tissue or organ ina single- or double-stranded form or capable of being expressed by asynthetic gene using methods well known to those skilled in the art,including as described in U.S. Pat. No. 6,506,559, which is incorporatedherein by reference.

TABLE 1 SEQ ID NOS of miRs, forward primers and MGB probes miR SEQ FWprimer MGB probe Sanger miR name ID NO: SEQ ID NO: SEQ ID NO: hsa-let-7b1 50  99 hsa-let-7f 2 51 100 hsa-miR-106a 3 52 101 hsa-miR-10a 4 53 102hsa-miR-10b 5 54 103 hsa-miR-122 6 55 104 hsa-miR-124 7 56 105hsa-miR-125b 8 57 106 hsa-miR-126 9 58 107 hsa-miR-128 10 59 108hsa-miR-130a 11 60 109 hsa-miR-138 12 61 110 hsa-miR-142-3p 13 62 111hsa-miR-143 14 63 112 hsa-miR-146a 15 64 113 hsa-miR-146b-5p 16 65 114hsa-miR-148b 17 66 115 hsa-miR-152 18 67 116 hsa-miR-15b 19 68 117hsa-miR-185 20 69 118 hsa-miR-192 21 70 119, 120 hsa-miR-193a-3p 22 71121 hsa-miR-19b 23 72 122 hsa-miR-200a 24 73 123 hsa-miR-200b 25 74 124hsa-miR-200c 26 75 125, 126 hsa-miR-205 27 76 127 hsa-miR-20a 28 77 128hsa-miR-21 29 78 129 hsa-miR-210 30 79 130 hsa-miR-221 31 80 131hsa-miR-222 32 81 132 hsa-miR-25 33 82 133 hsa-miR-29a 34 83 134hsa-miR-29b 35 84 135 hsa-miR-29c 36 85 136 hsa-miR-30a 37 86 137hsa-miR-31 38 87 138 hsa-miR-342-3p 39 88 139 hsa-miR-345 40 89 140hsa-miR-372 41 90 141 hsa-miR-375 42 91 142 hsa-miR-378 43 92 143hsa-miR-425 44 93 144 hsa-miR-451 45 94 145 hsa-miR-497 46 95 146hsa-miR-9* 47 96 147 hsa-mir-92b 48 97 148 hsa-miR-509-3p 49 150 151 U698 149Sanger miR name: the miRBase registry name (release 9-12)

3. Nucleic Acid Complexes

The nucleic acid may further comprise one or more of the following: apeptide, a protein, a RNA-DNA hybrid, an antibody, an antibody fragment,a Fab fragment, and an aptamer.

4. Pri-miRNA

The nucleic acid may comprise a sequence of a pri-miRNA or a variantthereof. The pri-miRNA sequence may comprise from 45-30,000, 50-25,000,100-20,000, 1,000-1,500 or 80-100 nucleotides. The sequence of thepri-miRNA may comprise a pre-miRNA, miRNA and miRNA*, as set forthherein, and variants thereof. The sequence of the pri-miRNA may compriseany of the sequences of SEQ ID NOS: 1-49 or variants thereof.

The pri-miRNA may comprise a hairpin structure. The hairpin may comprisea first and a second nucleic acid sequence that are substantiallycomplimentary. The first and second nucleic acid sequence may be from37-50 nucleotides. The first and second nucleic acid sequence may beseparated by a third sequence of from 8-12 nucleotides. The hairpinstructure may have a free energy of less than −25 Kcal/mole, ascalculated by the Vienna algorithm with default parameters, as describedin Hofacker et al. (Monatshefte f. Chemie 1994; 125:167-188), thecontents of which are incorporated herein by reference. The hairpin maycomprise a terminal loop of 4-20, 8-12 or 10 nucleotides. The pri-miRNAmay comprise at least 19% adenosine nucleotides, at least 16% cytosinenucleotides, at least 23% thymine nucleotides and at least 19% guaninenucleotides.

5. Pre-miRNA

The nucleic acid may also comprise a sequence of a pre-miRNA or avariant thereof. The pre-miRNA sequence may comprise from 45-90, 60-80or 60-70 nucleotides. The sequence of the pre-miRNA may comprise a miRNAand a miRNA* as set forth herein. The sequence of the pre-miRNA may alsobe that of a pri-miRNA excluding from 0-160 nucleotides from the 5′ and3′ ends of the pri-miRNA. The sequence of the pre-miRNA may comprise thesequence of SEQ ID NOS: 1-49 or variants thereof.

6. miRNA

The nucleic acid may also comprise a sequence of a miRNA (includingmiRNA*) or a variant thereof. The miRNA sequence may comprise from13-33, 18-24 or 21-23 nucleotides. The miRNA may also comprise a totalof at least 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,39 or 40 nucleotides. The sequence of the miRNA may be the first 13-33nucleotides of the pre-miRNA. The sequence of the miRNA may also be thelast 13-33 nucleotides of the pre-miRNA. The sequence of the miRNA maycomprise the sequence of SEQ ID NOS: 1-49 or variants thereof.

7. Probes

A probe comprising a nucleic acid described herein is also provided.Probes may be used for screening and diagnostic methods, as outlinedbelow. The probe may be attached or immobilized to a solid substrate,such as a biochip.

The probe may have a length of from 8 to 500, 10 to 100 or 20 to 60nucleotides. The probe may also have a length of at least 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 120, 140, 160, 180, 200, 220,240, 260, 280 or 300 nucleotides. The probe may further comprise alinker sequence of from 10-60 nucleotides. The probe may comprise anucleic acid that is complementary to a sequence selected from the groupconsisting of SEQ ID NOS: 1-49 or variants thereof. The probe maycomprise a sequence selected from the group consisting of SEQ ID NOS:99-149 and 151.

8. Biochip

A biochip is also provided. The biochip may comprise a solid substratecomprising an attached probe or plurality of probes described herein.The probes may be capable of hybridizing to a target sequence understringent hybridization conditions. The probes may be attached atspatially defined addresses on the substrate. More than one probe pertarget sequence may be used, with either overlapping probes or probes todifferent sections of a particular target sequence. The probes may becapable of hybridizing to target sequences associated with a singledisorder appreciated by those in the art. The probes may either besynthesized first, with subsequent attachment to the biochip, or may bedirectly synthesized on the biochip.

The solid substrate may be a material that may be modified to containdiscrete individual sites appropriate for the attachment or associationof the probes and is amenable to at least one detection method.Representative examples of substrates include glass and modified orfunctionalized glass, plastics (including acrylics, polystyrene andcopolymers of styrene and other materials, polypropylene, polyethylene,polybutylene, polyurethanes, TeflonJ, etc.), polysaccharides, nylon ornitrocellulose, resins, silica or silica-based materials includingsilicon and modified silicon, carbon, metals, inorganic glasses andplastics. The substrates may allow optical detection without appreciablyfluorescing.

The substrate may be planar, although other configurations of substratesmay be used as well. For example, probes may be placed on the insidesurface of a tube, for flow-through sample analysis to minimize samplevolume. Similarly, the substrate may be flexible, such as flexible foam,including closed cell foams made of particular plastics.

The biochip and the probe may be derivatized with chemical functionalgroups for subsequent attachment of the two. For example, the biochipmay be derivatized with a chemical functional group including, but notlimited to, amino groups, carboxyl groups, oxo groups or thiol groups.Using these functional groups, the probes may be attached usingfunctional groups on the probes either directly or indirectly using alinker. The probes may be attached to the solid support by either the 5′terminus, 3′ terminus, or via an internal nucleotide.

The probe may also be attached to the solid support non-covalently. Forexample, biotinylated oligonucleotides can be made, which may bind tosurfaces covalently coated with streptavidin, resulting in attachment.Alternatively, probes may be synthesized on the surface using techniquessuch as photopolymerization and photolithography.

9. Diagnostics

As used herein, the term “diagnosing” refers to classifying pathology,or a symptom, determining a severity of the pathology (grade or stage),monitoring pathology progression, forecasting an outcome of pathologyand/or prospects of recovery.

As used herein, the phrase “subject in need thereof” refers to an animalor human subject who is known to have cancer, at risk of having cancer(e.g., a genetically predisposed subject, a subject with medical and/orfamily history of cancer, a subject who has been exposed to carcinogens,occupational hazard, environmental hazard) and/or a subject who exhibitssuspicious clinical signs of cancer (e.g., blood in the stool or melena,unexplained pain, sweating, unexplained fever, unexplained loss ofweight up to anorexia, changes in bowel habits (constipation and/ordiarrhea), tenesmus (sense of incomplete defecation, for rectal cancerspecifically), anemia and/or general weakness). Additionally oralternatively, the subject in need thereof can be a healthy humansubject undergoing a routine well-being check up.

Analyzing presence of malignant or pre-malignant cells can be effectedin vivo or ex vivo, whereby a biological sample (e.g., biopsy) isretrieved. Such biopsy samples comprise cells and may be an incisionalor excisional biopsy. Alternatively, the cells may be retrieved from acomplete resection.

While employing the present teachings, additional information may begleaned pertaining to the determination of treatment regimen, treatmentcourse and/or to the measurement of the severity of the disease.

As used herein the phrase “treatment regimen” refers to a treatment planthat specifies the type of treatment, dosage, schedule and/or durationof a treatment provided to a subject in need thereof (e.g., a subjectdiagnosed with a pathology). The selected treatment regimen can be anaggressive one which is expected to result in the best clinical outcome(e.g., complete cure of the pathology) or a more moderate one which mayrelieve symptoms of the pathology yet results in incomplete cure of thepathology. It will be appreciated that in certain cases the treatmentregimen may be associated with some discomfort to the subject or adverseside effects (e.g., damage to healthy cells or tissue). The type oftreatment can include a surgical intervention (e.g., removal of lesion,diseased cells, tissue, or organ), a cell replacement therapy, anadministration of a therapeutic drug (e.g., receptor agonists,antagonists, hormones, chemotherapy agents) in a local or a systemicmode, an exposure to radiation therapy using an external source (e.g.,external beam) and/or an internal source (e.g., brachytherapy) and/orany combination thereof. The dosage, schedule and duration of treatmentcan vary, depending on the severity of pathology and the selected typeof treatment, and those of skill in the art are capable of adjusting thetype of treatment with the dosage, schedule and duration of treatment.

A method of diagnosis is also provided. The method comprises detectingan expression level of a specific cancer-associated nucleic acid in abiological sample. The sample may be derived from a patient. Diagnosisof a specific cancer state in a patient may allow for prognosis andselection of therapeutic strategy. Further, the developmental stage ofcells may be classified by determining temporarily expressed specificcancer-associated nucleic acids.

In situ hybridization of labeled probes to tissue arrays may beperformed. When comparing the fingerprints between individual samplesthe skilled artisan can make a diagnosis, a prognosis, or a predictionbased on the findings. It is further understood that the nucleic acidsequence which indicate the diagnosis may differ from those whichindicate the prognosis and molecular profiling of the condition of thecells may lead to distinctions between responsive or refractoryconditions or may be predictive of outcomes.

10. Kits

A kit is also provided and may comprise a nucleic acid described hereintogether with any or all of the following: assay reagents, buffers,probes and/or primers, and sterile saline or another pharmaceuticallyacceptable emulsion and suspension base. In addition, the kits mayinclude instructional materials containing directions (e.g., protocols)for the practice of the methods described herein. The kit may furthercomprise a software package for data analysis of expression profiles.

For example, the kit may be a kit for the amplification, detection,identification or quantification of a target nucleic acid sequence. Thekit may comprise a poly (T) primer, a forward primer, a reverse primer,and a probe.

Any of the compositions described herein may be comprised in a kit. In anon-limiting example, reagents for isolating miRNA, labeling miRNA,and/or evaluating a miRNA population using an array are included in akit. The kit may further include reagents for creating or synthesizingmiRNA probes. The kits will thus comprise, in suitable container means,an enzyme for labeling the miRNA by incorporating labeled nucleotide orunlabeled nucleotides that are subsequently labeled. It may also includeone or more buffers, such as reaction buffer, labeling buffer, washingbuffer, or a hybridization buffer, compounds for preparing the miRNAprobes, components for in situ hybridization and components forisolating miRNA. Other kits of the invention may include components formaking a nucleic acid array comprising miRNA, and thus may include, forexample, a solid support.

The following examples are presented in order to more fully illustratesome embodiments of the invention. They should, in no way be construed,however, as limiting the broad scope of the invention.

EXAMPLES Methods 1. Tumor Samples

903 tumor samples took part in the study. These included 252 that werepart of a preliminary study and 651 additional formalin-fixedparaffin-embedded (FFPE) samples. Tumor samples were obtained fromseveral sources. Institutional review approvals were obtained for allsamples in accordance with each institute's institutional review boardor IRB equivalent guidelines. Samples included primary tumors andmetastases of defined origins, according to clinical records. Tumorcontent was at least 50% for >95% of samples, as determined by apathologist based on hematoxylin-eosin (H&E) stained slides. 204 of the903 samples were used only in the validation phase, as an independentblinded test set. The reference diagnosis of these samples from theoriginal clinical record was confirmed by an additional review ofpathological specimens.

2. RNA Extraction

For FFPE samples, total RNA was isolated from seven to ten 10-μm-thicktissue sections using the miR extraction protocol developed at RosettaGenomics. Briefly, the sample was incubated a few times in xylene at 57°C. to remove paraffin excess, followed by ethanol washes. Proteins weredegraded by proteinase K solution at 45° C. for a few hours. The RNA wasextracted with acid phenol:chloroform followed by ethanol precipitationand DNAse digestion. Total RNA quantity and quality was checked byspectrophotometer (Nanodrop ND-1000).

3. miR Array Platform

Custom microarrays (Agilent Technologies, Santa Clara, Calif.) wereproduced by printing DNA oligonucleotide probes to more than 900 humanmicroRNAs. Each probe, printed in triplicate, carried up to22-nucleotide (nt) linker at the 3′ end of the microRNA's complementsequence, in addition to an amine group used to couple the probes tocoated glass slides. Twenty M of each probe were dissolved in2×SSC+0.0035% SDS and spotted in triplicate on Schott Nexterion® SlideE-coated microarray slides using a Genomic Solutions® BioRoboticsMicroGrid II according the MicroGrid manufacturer's directions.Fifty-four negative control probes were designed using the sensesequences of different microRNAs. Two groups of positive control probeswere designed to hybridize to miR array: (i) synthetic small RNAs werespiked to the RNA before labeling to verify the labeling efficiency; and(ii) probes for abundant small RNA (e.g., small nuclear RNAs (U43, U49,U24, Z30, U6, U48, U44), 5.8 s and 5 s ribosomal RNA) are spotted on thearray to verify RNA quality. The slides were blocked in a solutioncontaining 50 mM ethanolamine, 1 M Tris (pH9.0) and 0.1% SDS for 20 minat 50° C., then thoroughly rinsed with water and spun dry.

4. Cy-Dye Labeling of miRNA for miR Array

Five g of total RNA were labeled by ligation (Thomson et al. NatureMethods 2004; 1:47-53) of an RNA-linker, p-rCrU-Cy/dye (Dharmacon), tothe 3′ end with Cy3 or Cy5. The labeling reaction contained total RNA,spikes (0.1-20 fmoles), 300 ng RNA-linker-dye, 15% DMSO, 1× ligasebuffer and 20 units of T4 RNA ligase (NEB), and proceeded at 4° C. for 1h, followed by 1 h at 37° C. The labeled RNA was mixed with 3×hybridization buffer (Ambion), heated to 95° C. for 3 min and then addedon top of the miR array. Slides were hybridized for 12-16 h at 42° C.,followed by two washes at room temperature with 1×SSC and 0.2% SDS and afinal wash with 0.1×SSC.

Arrays were scanned using an Agilent Microarray Scanner Bundle G2565BA(resolution of 10 μm at 100% power). Array images were analyzed usingSpotReader software (Niles Scientific).

5. Array Signal Calculation and Normalization

Triplicate spots were combined to produce one signal for each probe bytaking the logarithmic mean of reliable spots. All data werelog-transformed (natural base) and the analysis was performed inlog-space. A reference data vector for normalization R was calculated bytaking the median expression level for each probe across all samples.For each sample data vector S, a 2nd degree polynomial F was found so asto provide the best fit between the sample data and the reference data,such that R≈F(S). Remote data points (“outliers”) were not used forfitting the polynomial F. For each probe in the sample (element Si inthe vector S), the normalized value (in log-space) Mi was calculatedfrom the initial value Si by transforming it with the polynomialfunction F, so that Mi=F(Si). Data were translated back to linear-space(by taking the exponent). Using only the training set samples togenerate the reference data vector did not affect the results.

6. Logistic Regression

The aim of a logistic regression model is to use several features, suchas expression levels of several microRNAs, to assign a probability ofbelonging to one of two possible groups, such as two branches of a nodein a binary decision-tree. Logistic regression models the natural log ofthe odds ratio, i.e., the ratio of the probability of belonging to thefirst group, for example, the left branch in a node of a binarydecision-tree (P) over the probability of belonging to the second group,for example, the right branch in such a node (1−P), as a linearcombination of the different expression levels (in log-space). Thelogistic regression assumes that:

${{\ln \left( \frac{P}{1 - P} \right)} = {{\beta_{0} + {\sum\limits_{i = 1}^{N}{\beta_{i} \cdot M_{i}}}} = {\beta_{0} + {\beta_{1} \cdot M_{1}} + {\beta_{2} \cdot M_{2}} + \ldots}}}\mspace{11mu},$

where β₀ is the bias, M_(i) is the expression level (normalized, inlog-space) of the i-th microRNA used in the decision node, and β_(i) isits corresponding coefficient. βi>0 indicates that the probability totake the left branch (P) increases when the expression level of thismicroRNA (Mi) increases, and the opposite for βi<0. If a node uses onlya single microRNA (M), then solving for P results in:

$P = {\frac{e^{\beta_{0} + {\beta_{1} \cdot M}}}{1 + e^{\beta_{0} + {\beta_{1} \cdot M}}}.}$

The regression error on each sample is the difference between theassigned probability P and the true “probability” of this sample, i.e.,1 if this sample is in the left branch group and 0 otherwise. Thetraining and optimization of the logistic regression model calculatesthe parameters β and the p-values (for each microRNA by the Waldstatistic and for the overall model by the χ2 (chi-square) difference),maximizing the likelihood of the data given the model and minimizing thetotal regression error

${\sum\limits_{\underset{group}{\underset{{in}\mspace{14mu} {first}}{Samples}}}^{\;}\left( {1 - P_{j}} \right)} + {\sum\limits_{\underset{group}{\underset{{in}\mspace{14mu} {second}}{Samples}}}^{\;}{P_{j}.}}$

The probability output of the logistic model is here converted to abinary decision by comparing P to a threshold, denoted by P_(TH), i.e.,if P>P_(TH) then the sample belongs to the left branch (“first group”)and vice versa. Choosing at each node the branch which has aprobability>0.5, i.e., using a probability threshold of 0.5, leads to aminimization of the sum of the regression errors. However, as the goalwas the minimization of the overall number of misclassifications (andnot of their probability), a modification which adjusts the probabilitythreshold (P_(TH)) was used in order to minimize the overall number ofmistakes at each node (Table 3). For each node the threshold to a newprobability threshold P_(TH) was optimized such that the number ofclassification errors is minimized. This change of probability thresholdis equivalent (in terms of classifications) to a modification of thebias β₀, which may reflect a change in the prior frequencies of theclasses.

7. Stepwise Logistic Regression and Feature Selection

The original data contain the expression levels of multiple microRNAsfor each sample, i.e., multiple of data features. In training theclassifier for each node, only a small subset of these features wasselected and used for optimizing a logistic regression model. In theinitial training this was done using a forward stepwise scheme. Thefeatures were sorted in order of decreasing log-likelihoods, and thelogistic model was started off and optimized with the first feature. Thesecond feature was then added, and the model re-optimized. Theregression error of the two models was compared: if the addition of thefeature did not provide a significant advantage (a χ2 difference lessthan 7.88, p-value of 0.005), the new feature was discarded. Otherwise,the added feature was kept. Adding a new feature may make a previousfeature redundant (e.g., if they are very highly correlated). To checkfor this, the process iteratively checks if the feature with lowestlikelihood can be discarded (without losing χ2 difference as above).After ensuring that the current set of features is compact in thissense, the process continues to test the next feature in the sortedlist, until features are exhausted. No limitation on the number offeature was inserted into the algorithm, but in most cases 2-3 featureswere selected.

The stepwise logistic regression method was used on subsets of thetraining set samples by re-sampling the training set with repetition(“bootstrap”), so that each of the 20 runs contained about two-thirds ofthe samples at least once, and any one sample had >99% chance of beingleft out at least once. This resulted in an average of 2-3 features pernode (4-8 in more difficult nodes). A robust set of 2-3 features pereach node was selected (Table 3) by comparing features that wererepeatedly chosen in the bootstrap sets to previous evidence, andconsidering their signal strengths and reliability. When using theseselected features to construct the classifier, the stepwise process wasnot used and the training optimized the logistic regression modelparameters only.

8. K-Nearest-Neighbors (KNN) Classification Algorithm

The KNN algorithm (see e.g., Ma et al., Arch Pathol Lab Med 2006;130:465-73) calculated the distance (Pearson correlation) of any sampleto all samples in the training set, and classifies the sample by themajority vote of the k samples which are most similar (k being aparameter of the classifier). The correlation is calculated on thepre-defined set of microRNAs (the 48 microRNAs that were used by thedecision-tree). KNN algorithms with k=1; 10 were compared, and theoptimal performer was selected, using k=7.

9. qRT-PCR

Total RNA (1 μg) is subjected to polyadenylation reaction as describedbefore (Gilad et al., PLoS ONE 2008; 3:e3148). Briefly, RNA is incubatedin the presence of poly (A) polymerase (PAP) (Takara-2180A), MnCl2, andATP for 1 h at 37° C. Reverse transcription is performed on the totalRNA. An oligodT primer harboring a consensus sequence (complementary tothe reverse primer, oligodT starch, an N nucleotide (a mixture of all A,C, and G) and V nucleotide (mixture of four nucleotides) was used forthe reverse transcription reaction. The primer was first annealed to thepolyA-RNA and then subjected to a reverse transcription reaction ofSuperScript II RT (Invitrogen). The cDNA was then amplified by areal-time PCR reaction, using a microRNA-specific forward primer, TaqManprobe and universal reverse primer that is complementary to the 3′sequence of the oligo dT tail. The reactions were incubated for 10 minat 95° C., followed by 42 cycles of 95° C. for 15 s and 60° C. for 1min. qRT-PCR was performed using probes for the 104 candidate microRNAs,of which 5 were tested with two different forward primers, and for U6snoRNA.

10. Feature Selection and Training

The training samples were kept with average C_(t) below 36 and at least30 microRNAs detected (C_(t)<38). Each sample was normalized bysubtracting from the C_(t) of each microRNA the average C_(t) of allmicroRNAs of the sample, and adding back a scaling constant (the averageC_(t) over the entire sample set). Feature selection and classifiertraining were using the scaled C_(t) as the input signal. The featureselection resulted in a set of 48 microRNAs. The decision-tree (FIG. 1)used logistic regression on combinations of two-to-three microRNAs ineach node to make binary decisions. The KNN was based on comparing theexpression of all 48 microRNAs in each sample to all other samples inthe training database. The decision-tree and KNN each return a predictedtissue of origin and histological type where applicable. The classifierreturns the two different predictions or a single consensus predictionif the predictions concur. When the decision-tree and KNN predictdifferent histological types of the same tissue of origin, the tissue oforigin is returned as a consensus prediction with no histological typeindicated.

11. Test Protocol

RNA was extracted in batches together with a negative control. Thenegative control was a no-RNA sample that served to detect potentialcontaminations, and should not give any signal in the PCR reaction. Theextracted RNA, together with a positive control sample, underwent cDNApreparation and 48 microRNAs were measured by qRT-PCR in duplicates inone 96-well plate per sample. The positive control was a specific RNAsample that should meet defined C_(t) ranges in the assay. Qualityassessment of each well was based on the fluorescence amplificationcurve, using thresholds on the maximal fluorescence and the linear slopeas a function of the measured C_(t). For each microRNA, C_(t) ^(miR) wascalculated by taking the average C_(t) of the two repeats. Qualityassessment for each sample was based on the number and identity ofexpressed microRNAs (C_(t)<38) and the average C_(t) of the measuredmicroRNAs. C_(t) ^(miR) values for each sample were normalized byrescaling as described above. The rescaled values were used as input tothe classifier that was trained using qRT-PCR data (as described above).

Example 1 Samples and Profiling

A discovery process that profiled hundreds of samples on the arrayplatforms was performed to identify candidate biomarkers. A training setof ˜400 FFPE samples was used. RNA was extracted from these samples andqRT-PCR was preformed. An assay was constructed using 48 microRNAs(Table 3; FIGS. 1-7), to differentiate between 26 classes representing18 tissue origins. An alternative assay was constructed, which does notidentify bladder as an origin, i.e., differentiates between 25 classesrepresenting 17 tissue origins.

A validation set of 255 new FFPE tumor samples was used to assess theperformance of the assay, representing 26 different tumor origins or“classes” (see Table 2 for a summary of samples). About half of thesamples in the set were metastatic tumors to different sites (e.g.,lung, bone, brain and liver). Tumor percentage was at least 50% for allsamples in the set.

TABLE 2 Cancer types, classes and histology Class Cancer types andhistological classifications 1 bladder transitional cell carcinoma 2biliary tract cholangiocarcinoma, gallbladder adenocarcinoma 3brain-astrocytoma astrocytic tumor; astrocytic tumor, anaplasticastrocytoma; astrocytic tumor, glioblastoma multiforme 4 brain-oligodendroglial tumor, anaplastic oligodendroglioma oligodendroglioma;oligodendroglial tumor, oligodendroglioma 5 breast adenocarcinoma;invasive ductal carcinoma 6 colon adenocarcinoma 7 esophagus-squamoussquamous cell carcinoma 8 esophagus-stomach esophagus adenocarcinoma;stomach adenocarcinoma 9 head & neck squamous cell carcinoma of thelarynx, pharynx and nose 10 kidney renal cell carcinoma; clear cellcarcinoma 11 liver hepatocellular carcinoma 12 lung-carcinoidneuroendocrine, carcinoid 13 lung-squamous NSCLC, squamous cellcarcinoma 14 lung-adeno-large non-small, adenocarcinoma; non-small,large cell carcinoma 15 lung-small neuroendocrine, small 16 melanomamalignant melanoma 17 ovary-serous ovary serous adenocarcinoma 18ovary-endometrioid ovary endometrioid adenocarcinoma 19 pancreasadenocarcinoma 20 prostate adenocarcinoma 21 testis-seminoma GCT;seminoma 22 testis-non-seminoma GCT; non-seminoma 23 thymus thymoma -type B2; thymoma - type B3 24 thyroid-follicular follicular carcinoma 25thyroid-medullary neuroendocrine; medullary 26 thyroid-papillarypapillary carcinoma; tall cell

Example 2 Decision-Tree Classification Algorithm

A tumor classifier was built using the microRNA expression levels byapplying a binary tree classification scheme (FIG. 1). This framework isset up to utilize the specificity of microRNAs in tissue differentiationand embryogenesis: different microRNAs are involved in various stages oftissue specification, and are used by the algorithm at differentdecision points or “nodes”. The tree breaks up the complex multi-tissueclassification problem into a set of simpler binary decisions. At eachnode, classes which branch out earlier in the tree are not considered,reducing interference from irrelevant samples and further simplifyingthe decision. The decision at each node can then be accomplished usingonly a small number of microRNA biomarkers, which have well-definedroles in the classification (Table 3). The structure of the binary treewas based on a hierarchy of tissue development and morphologicalsimilarity¹⁸, which was modified by prominent features of the microRNAexpression patterns. For example, the expression patterns of microRNAsindicated a significant difference between liver-cholangio tumors andtumors of non-liver origin, and these are therefore separated at node #1(FIG. 2) into separate branches (FIG. 1).

For each of the individual nodes logistic regression models were used, arobust family of classifiers which are frequently used inepidemiological and clinical studies to combine continuous data featuresinto a binary decision (FIGS. 2-7 and Methods). Since gene expressionclassifiers have an inherent redundancy in selecting the gene features,we used bootstrapping on the training sample set as a method to select astable microRNA set for each node (Methods). This resulted in a smallnumber (usually 2-3) of microRNA features per node, totaling 48microRNAs for the full classifier (Table 3). This approach provides asystematic process for identifying new biomarkers for differentialexpression.

Example 3 Defining High-Confidence Classifications

In clinical practice it is often useful to assess information ofdifferent degrees of confidence^(17,18). In the diagnosis of tumororigin in particular, a short list of highly probable possibilities is apractical option when no definite diagnosis can be made. Since thedecision-tree and the KNN algorithms are designed differently andtrained independently, improved accuracy and greater confidence can beobtained by combining and comparing their classifications. When the twoclassifiers agree, the diagnosis is considered high-confidence and asingle origin is identified. When the two disagree, the classificationis made with low-confidence and two origins are suggested. Sensitivityof the union refers to the percentage in which at least one of theclassifiers (Tree and KNN) was correct.

Example 4 Performance of the Test in Blinded Validation

The test performance was assessed using an independent set of 204validation samples. These archival samples included primary as well asmetastatic tumor samples, preserved as FFPE blocks, whose originalclinical diagnosis (“reference diagnosis”) was one of the origins onwhich the classifier was trained. The samples were processed bypersonnel who were blinded to the original reference diagnosis for thesesamples, and classifications were automatically generated by dedicatedsoftware. 16 of the 204 samples (8%) failed QA criteria. For 188 samples(92%), including 87 metastatic tumor samples (46% of the samples), thetest was completed successfully and produced tissue-of-originpredictions. For 159 of these samples (84%), the reference diagnosis fortissue of origin was predicted by at least one of the two classifiers(Table 4). For 124 samples (66%), the two classifiers agreed, generatinga consensus prediction for a single tissue-of-origin. For thesesingle-prediction cases, the sensitivity (positive agreement) was 90%(111/124 of the classifications agreed with the reference diagnosis),and it exceeded 90% for most tissue-types. Specificity (negativeagreement) in this group ranged from 94% to 100%.

FFPE sections from 73 of the validation samples were processedindependently and blindly in a second laboratory. Data andclassifications for these samples were compared between the twolaboratories. The mean correlation for the qRT-PCR signals was 0.979 (4samples had correlation coefficients between 0.91 and 0.95, all othercorrelations were greater than 0.95). The two labs disagreed on only 4samples. For another 8, they had one of two answers in common and forthe remaining 61, classifications matched perfectly between the twolaboratories, demonstrating the precision of the test.

TABLE 3 Nodes of the decision-tree and microRNAs (# SEQ ID NO.) used ineach node Left Node Right Node Num Or Node Num Node Node Node Node NodeNode All Classes Num Class Or Class miR 1 miR 2 miR 3 Beta 0 Node Beta 1Beta 2 Node Beta 3 Right 1 2 3 hsa- hsa- — 9.11E+01 4.42E+00 −8.39E+00NaN biliary tract miR- miR- carcinoma, liver 200c 122 (#26) (#6) 2 liverbiliary tract hsa- hsa- — −3.10E+03 6.76E+01 2.48E+01 NaN livercarcinoma miR- miR- 200b 126 (#25) (#9) 3 4 5 hsa- — — 5.34E+02−1.56E+01 NaN NaN testis-non-seminoma, miR- testis-seminoma 372 (#41) 4testis-non- testis- hsa- hsa- hsa- −6.13E+02 −2.10E+01 −1.59E+015.68E+01 testis-non-seminoma seminoma seminoma miR- miR- miR- 451 22192b (#45) (#31) (#48) 5 9 6 hsa- hsa- — 1.18E+02 1.63E+00 −5.26E+00 NaNbiliary tract miR- miR- carcinoma, bladder, 148b 200c breast, colon,(#17) (#26) esophagus-squamous, head_neck, lung-adeno large,lung-carcinoid, lung-small_cell, lung- squamous, ovary- endometrioid,ovary- serous, pancreas, prostate, stomach/esophagus- adeno, thymus,thyroid-follicular, thyroid-medullary, thyroid-papillary 6 melanoma 7hsa- hsa- — −6.61E+02 3.58E+01 −1.54E+01 NaN melanoma miR- miR- 497 146a(#46) (#15) 7 8 kidney hsa- hsa- — 6.66E+02 −1.02E+01 −8.88E+00 NaNbrain-astrocytoma, miR- miR- brain- 9* 124 oligodendroglioma (#47) (#7)8 brain- brain- hsa- hsa- — −3.99E+03 7.04E+01 5.75E+01 NaNbrain-astrocytoma astrocytoma oligodendro- miR- miR- glioma 497 128(#46) (#10) 9 10 12 hsa- hsa- hsa- 2.56E+01 −1.20E+00 1.29E+00 −1.22E+00lung-carcinoid, lung- miR- miR- miR- small cell, thyroid- 15b 152 375medullary (#19) (#18) (#42) 10 11 thyroid- hsa- hsa- — −3.52E+022.97E+01 −1.89E+01 NaN lung-carcinoid, lung- medullary miR- miR- smallcell 222 200a (#32) (#24) 11 lung- lung-small hsa- hsa- — 2.53E+022.36E+01 −3.12E+01 NaN lung-carcinoid carcinoid cell miR- miR- 106a 29c(#3) (#36) 12 13 16 hsa- hsa- — 7.25E+01 3.73E−01 −2.38E+00 NaN Biliarytract miR- miR- carcinoma, colon, 106a 192 pancreas, (#3) (#21)stomach/esophagus- adeno 13 14 15 hsa- hsa- hsa- −1.40E+02 1.90E−013.33E+00 1.18E+00 colon, miR- let-7b miR- stomach/esophagus- 21 (#1) 30aadeno (#29) (#37) 14 colon Stomach hsa- hsa- hsa- 9.31E+02 −1.18E+011.32E+01 −3.37E+01 colon esophagus- miR- miR- miR- adeno 10a 92b 29a-(#4) (#48) fw18 (#34) 15 pancreas biliary tract hsa- hsa- hsa- −2.06E+029.37E+00 −6.45E+00 3.10E+00 pancreas carcinoma miR- miR- miR- 25 200c20a- (#33) (#26) fw18 (#28) 16 prostate 17 hsa- hsa- — 2.68E+02 8.84E+00−2.10E+01 NaN prostate miR- miR- 185 375 (#20) (#42) 17 18 19 hsa- hsa-hsa- 1.35E+02 −1.53E+00 −1.63E+00 −8.83E−01 ovary-endometrioid, miR-miR- miR- ovary-serous 10b- 130a 210 fw18 (#11) (#30) (#5) 18 ovary-ovary- hsa- hsa- hsa- −3.81E+02 3.64E+00 4.26E+00 3.38E+00ovary-endometrioid endometrioid serous miR- miR- let-7f 148b 193a- (#2)(#17) 3p (#22) 19 20 breast hsa- hsa- — −1.32E+02 2.26E+00 1.66E+00 NaNbladder, esophagus- miR- miR- squamous, head/neck, 193a- 342-3plung-adeno large, 3p (#39) lung-squamous, (#22) thymus, thyroid-follicular, thyroid- papillary 20 23 21 hsa- hsa- — −2.02E+01 −1.22E+001.82E+00 NaN bladder, esophagus- miR- miR- squamous, head/neck, 205146b-5p lung-squamous, (#27) (#16) thymus 21 lung-adeno 22 hsa- hsa- —−3.03E+03 3.93E+01 6.02E+01 NaN lung-adeno large large miR- miR- 125b30a (#8) (#37) 22 thyroid- thyroid- hsa- hsa- — −1.53E+03 2.78E+012.17E+01 NaN thyroid-follicular follicular papillary miR- miR- 31 21(#38) (#29) 23 24 thymus hsa- hsa- — −9.24E+01 5.39E+00 −2.98E+00 NaNbladder, esophagus- miR- miR- squamous, head/neck, 29b 21 lung-squamous(#35) (#29) 24 25 bladder hsa- hsa- hsa- −9.03E+01 2.70E+00 1.79E+00−1.73E+00 esophagus-squamous, miR- miR- miR- head/neck, lung- 425 10a375 squamous (#44) (#4) (#42) 25 lung- 26 hsa- hsa- hsa- 2.52E+02−2.10E+00 −3.19E+00 −2.50E+00 lung-squamous squamous miR- miR- miR- 10a19b 222 (#4) (#23) (#32) 26 esophagus- head/neck hsa- hsa- hsa-−1.32E+02 −1.75E+01 4.47E+00 1.53E+01 esophagus-squamous squamous miR-miR- miR- 143 451 30a (#14) (#45) (#37) Node Num The number of the node(1-26) Left Node Num Or Left branch - the node number or the classreached Class Right Node Num Or Right branch - the node number or theclass reached Class Node miR1 miRs used in node - #1 Node miR2 miRs usedin node - #2 (could be empty) Node miR3 miRs used in node - #3 (could beempty) Node Beta 0 The value of the beta0 (intercept) Node Beta 1 Thevalue of the beta1, corresponding to nodeMir1 Node Beta 2 The value ofthe beta2, corresponding to nodeMir2 - could be NaN (empty) Node Beta 3The value of the beta3, corresponding to nodeMir3 - could be NaN (empty)Node All Classes A list of all the classes that are on the left branchLeft Node All Classes A list of all the classes that are on the rightbranch Right

TABLE 4 Performance of the test in blinded validation SuccessfulSensitivity Specificity Fraction in Sensitivity Specificity samples inunion of union of high of high of high Class test set predictionprediction confidence confidence confidence Biliary tract 6 66.67 93.9633.33 100 98.36 Brain 10 100 100 80 100 100 Breast 33 66.67 93.55 45.4553.33 100 Colon 9 88.89 94.41 66.67 83.33 99.15 Esophagus 1 100 98.4 0NaN 100 Neck & Head 3 100 92.43 100 100 97.52 Kidney 8 87.5 99.44 62.580 100 Liver 8 100 99.44 100 100 100 Lung 23 91.3 84.85 86.96 95 94.23Melanocyte 7 85.71 97.79 85.71 83.33 100 Ovary 13 84.62 100 38.46 100100 Pancreas 6 50 97.8 16.67 100 99.19 Prostate 19 89.47 99.41 57.89 100100 Stomach or 5 40 98.91 40 50 100 esophagus Testis 7 100 100 100 100100 Thymus 6 83.33 97.8 83.33 80 100 Thyroid 24 100 98.17 83.33 100 100Total 188 84.57 96.91 65.96 89.52 99.34

Example 5 Classification Example

One of the training-set samples originally diagnosed in the clinicalsetting as a metastatic tumor to the brain originating from the lung,was classified by the tree (in leave-one-out cross-validation) asoriginating from the liver. This classification was traced back to node#1, the branching point where lung and liver origins diverge (FIG. 1).This node uses hsa-miR-122 (SEQ ID NO: 6), together with hsa-miR-200c(SEQ ID NO: 26). The expression of these microRNAs in this sample, inparticular the very high expression of hsa-miR-122 (FIG. 8A), are strongindicators of a possible hepatic origin of this sample. Uponre-examination of the clinical record, it was found that this sample wasoriginally classified as a lung metastasis based on the fact that thepatient had a known mass in the lung. This disagreement between theoriginal clinical diagnosis and our test was followed up by blindedpathological review. Indeed, the sample's immunohistochemical stainingpattern was incompatible with lung adenocarcinoma origin, but wasconsistent with a diagnosis of hepatocellular carcinoma (FIG. 8B). Thus,the test could suggest an alternative diagnosis for this patient, namelya primary hepatocellular carcinoma with metastatic spread to both lungand brain.

Example 6

Variant microRNAs

For some of the microRNAs in Table 3, other variant microRNAs having asimilar seed sequence (identical nucleotides 2-8) are known in the humangenome (see Table 5), and are therefore considered to target a verysimilar set of (mRNA-coding) genes (via the RISC machinery). ThesemicroRNAs with identical seed sequence may be substituted for theindicated miRs.

TABLE 5 microRNAs with identical seed sequence miRs with SEQIndicated miRs Seed same seed miR sequence ID NO: hsa-let-7b GAGGTAGhsa-let-7d AGAGGTAGTAGGTTGCATAGTT 152 GAGGTAG hsa-let-7eTGAGGTAGGAGGTTGTATAGTT 153 GAGGTAG hsa-miR-98 TGAGGTAGTAAGTTGTATTGTT 154GAGGTAG hsa-let-7f TGAGGTAGTAGATTGTATAGTT   2 GAGGTAG hsa-let-7aTGAGGTAGTAGGTTGTATAGTT 155 GAGGTAG hsa-let-7c TGAGGTAGTAGGTTGTATGGTT 156GAGGTAG hsa-let-7g TGAGGTAGTAGTTTGTACAGTT 157 GAGGTAG hsa-let-7iTGAGGTAGTAGTTTGTGCTGTT 158 hsa-let-7f GAGGTAG hsa-let-7dAGAGGTAGTAGGTTGCATAGTT 152 GAGGTAG hsa-let-7e TGAGGTAGGAGGTTGTATAGTT 153GAGGTAG hsa-miR-98 TGAGGTAGTAAGTTGTATTGTT 154 GAGGTAG hsa-let-7cTGAGGTAGTAGGTTGTATGGTT 156 GAGGTAG hsa-let-7b TGAGGTAGTAGGTTGTGTGGTT   1GAGGTAG hsa-let-7g TGAGGTAGTAGTTTGTACAGTT 157 GAGGTAG hsa-let-7iTGAGGTAGTAGTTTGTGCTGTT 158 hsa-miR-106a AAAGTGC hsa-miR-519dCAAAGTGCCTCCCTTTAGAGTG 159 AAAGTGC hsa-miR-20b CAAAGTGCTCATAGTGCAGGTAG160 AAAGTGC hsa-miR-93 CAAAGTGCTGTTCGTGCAGGTAG 161 AAAGTGC hsa-miR-17CAAAGTGCTTACAGTGCAGGTAG 162 AAAGTGC hsa-miR-526b* GAAAGTGCTTCCTTTTAGAGGC163 AAAGTGC hsa-miR-106b TAAAGTGCTGACAGTGCAGAT 164 AAAGTGC hsa-miR-20aTAAAGTGCTTATAGTGCAGGTAG  28 hsa-miR-10a ACCCTGT hsa-miR-10bTACCCTGTAGAACCGAATTTGTG 165 hsa-miR-10b ACCCTGT hsa-miR-10aTACCCTGTAGATCCGAATTTGTG   4 hsa-miR-124 AAGGCAC hsa-miR-506TAAGGCACCCTTCTGAGTAGA 166 hsa-miR-125b CCCTGAG hsa-miR-125a-5pTCCCTGAGACCCTTTAACCTGTGA 167 hsa-miR-130a AGTGCAA hsa-miR-301aCAGTGCAATAGTATTGTCAAAGC 168 AGTGCAA hsa-miR-301b CAGTGCAATGATATTGTCAAAGC169 AGTGCAA hsa-miR-130b CAGTGCAATGATGAAAGGGCAT 170 AGTGCAA hsa-miR-454TAGTGCAATATTGCTTATAGGGT 171 hsa-miR-146a GAGAACT hsa-miR-146b-5pTGAGAACTGAATTCCATAGGCT  16 hsa-miR-146b-5p GAGAACT hsa-miR-146aTGAGAACTGAATTCCATGGGTT  15 hsa-miR-148b CAGTGCA hsa-miR-148aTCAGTGCACTACAGAACTTTGT 172 CAGTGCA hsa-miR-152 TCAGTGCATGACAGAACTTGG  18hsa-miR-152 CAGTGCA hsa-miR-148a TCAGTGCACTACAGAACTTTGT 172 CAGTGCAhsa-miR-148b TCAGTGCATCACAGAACTTTGT  17 hsa-miR-15b AGCAGCA hsa-miR-424CAGCAGCAATTCATGTTTTGAA 173 AGCAGCA hsa-miR-497 CAGCAGCACACTGTGGTTTGT  46AGCAGCA hsa-miR-195 TAGCAGCACAGAAATATTGGC 174 AGCAGCA hsa-miR-15aTAGCAGCACATAATGGTTTGTG 175 AGCAGCA hsa-miR-16 TAGCAGCACGTAAATATTGGCG 176hsa-miR-192 TGACCTA hsa-miR-215 ATGACCTATGAATTGACAGAC 177hsa-miR-193a-3p ACTGGCC hsa-miR-193b AACTGGCCCTCAAAGTCCCGCT 178hsa-miR-19b GTGCAAA hsa-miR-19a TGTGCAAATCTATGCAAAACTGA 179 hsa-miR-200aAACACTG hsa-miR-141 TAACACTGTCTGGTAAAGATGG 180 hsa-miR-200b AATACTGhsa-miR-200c TAATACTGCCGGGTAATGATGGA  26 hsa-miR-200b AATACTGhsa-miR-429 TAATACTGTCTGGTAAAACCGT 181 hsa-miR-200c AATACTG hsa-miR-200bTAATACTGCCTGGTAATGATGA  25 AATACTG hsa-miR-429 TAATACTGTCTGGTAAAACCGT181 hsa-miR-20a AAAGTGC hsa-miR-106a AAAAGTGCTTACAGTGCAGGTAG   3 AAAGTGChsa-miR-519d CAAAGTGCCTCCCTTTAGAGTG 159 AAAGTGC hsa-miR-20bCAAAGTGCTCATAGTGCAGGTAG 160 AAAGTGC hsa-miR-93 CAAAGTGCTGTTCGTGCAGGTAG161 AAAGTGC hsa-miR-17 CAAAGTGCTTACAGTGCAGGTAG 162 AAAGTGC hsa-miR-526b*GAAAGTGCTTCCTTTTAGAGGC 163 AAAGTGC hsa-miR-106b TAAAGTGCTGACAGTGCAGAT164 hsa-miR-21 AGCTTAT hsa-miR-590-5p GAGCTTATTCATAAAAGTGCAG 182hsa-miR-221 GCTACAT hsa-miR-222 AGCTACATCTGGCTACTGGGT  32 hsa-miR-222GCTACAT hsa-miR-221 AGCTACATTGTCTGCTGGGTTTC  31 hsa-miR-25 ATTGCAChsa-miR-363 AATTGCACGGTATCCATCTGTA 184 ATTGCAC hsa-miR-367AATTGCACTTTAGCAATGGTGA 185 ATTGCAC hsa-miR-32 TATTGCACATTACTAAGTTGCA 186ATTGCAC hsa-miR-92b TATTGCACTCGTCCCGGCCTCC  48 ATTGCAC hsa-miR-92aTATTGCACTTGTCCCGGCCTGT 187 hsa-miR-29a AGCACCA hsa-miR-29bTAGCACCATTTGAAATCAGTGTT  35 AGCACCA hsa-miR-29c TAGCACCATTTGAAATCGGTTA 36 hsa-miR-29b AGCACCA hsa-miR-29a TAGCACCATCTGAAATCGGTTA  34 AGCACCAhsa-miR-29c TAGCACCATTTGAAATCGGTTA  36 hsa-miR-29c AGCACCA hsa-miR-29aTAGCACCATCTGAAATCGGTTA  34 AGCACCA hsa-miR-29b TAGCACCATTTGAAATCAGTGTT 35 hsa-miR-30a GTAAACA hsa-miR-30d TGTAAACATCCCCGACTGGAAG 188 GTAAACAhsa-miR-30b TGTAAACATCCTACACTCAGCT 189 GTAAACA hsa-miR-30cTGTAAACATCCTACACTCTCAGC 190 GTAAACA hsa-miR-30e TGTAAACATCCTTGACTGGAAG191 hsa-miR-372 AAGTGCT hsa-miR-520a-3p AAAGTGCTTCCCTTTGGACTGT 192AAGTGCT hsa-miR-520b AAAGTGCTTCCTTTTAGAGGG 193 AAGTGCT hsa-miR-520c-3pAAAGTGCTTCCTTTTAGAGGGT 194 AAGTGCT hsa-miR-520e AAAGTGCTTCCTTTTTGAGGG195 AAGTGCT hsa-miR-520e-3p AAAGTGCTTCTCTTTGGTGGGT 196 AAGTGCThsa-miR-373 GAAGTGCTTCGATTTTGGGGTGT 197 AAGTGCT hsa-miR-302eTAAGTGCTTCCATGCTT 198 AAGTGCT hsa-miR-302c TAAGTGCTTCCATGTTTCAGTGG 199AAGTGCT hsa-miR-302d TAAGTGCTTCCATGTTTGAGTGT 200 AAGTGCT hsa-miR-202bTAAGTGCTTCCATGTTTTAGTAG 201 AAGTGCT hsa-miR-302a TAAGTGCTTCCATGTTTTGGTGA202 hsa-miR-378 CTGGACT hsa-miR-422a ACTGGACTTAGGGTCAGAAGGC 203hsa-miR-497 AGCAGCA hsa-miR-424 CAGCAGCAATTCATGTTTTGAA 173 AGCAGCAhsa-miR-195 TAGCAGCACAGAAATATTGGC 174 AGCAGCA hsa-miR-15aTAGCAGCACATAATGGTTTGTG 175 AGCAGCA hsa-miR-15b TAGCAGCACATCATGGTTTACA 19 AGCAGCA hsa-miR-16 TAGCAGCACGTAAATATTGGCG 176 hsa-miR-92b ATTGCAChsa-miR-363 AATTGCACGGTATCCATCTGTA 184 ATTGCAC hsa-miR-367AATTGCACTTTAGCAATGGTGA 185 ATTGCAC hsa-miR-25 CATTGCACTTGTCTCGGTCTGA  33ATTGCAC hsa-miR-32 TATTGCACATTACTAAGTTGCA 186 ATTGCAC hsa-miR-92aTATTGCACTTGTCCCGGCCTGT 187

For some of the microRNAs in Table 3, other microRNAs that are known inthe human genome are located in close proximity on the genome (genomiccluster) (see Table 6), and are co-transcribed with the indicated miRs.These microRNAs from nearly the same genomic location may be substitutedfor the indicated miRs.

TABLE 6 microRNAs within the same genomic cluster miRs within the SEQIndicated miRs same genomic cluster miR sequence ID NO: hsa-let-7bhsa-let-7a TGAGGTAGTAGGTTGTATAGTT 155 hsa-let-7a* CTATACAATCTACTGTCTTTC204 hsa-let-7b* CTATACAACCTACTGCCTTCCC 205 hsa-let-7f hsa-let-7a*CTATACAATCTACTGTCTTTC 204 hsa-let-7a TGAGGTAGTAGGTTGTATAGTT 155hsa-let-7d AGAGGTAGTAGGTTGCATAGTT 152 hsa-let-7d* CTATACGACCTGCTGCCTTTCT206 hsa-let-7f-1* CTATACAATCTATTGCCTTCCC 207 hsa-1et-7f-2*CTATACAGTCTACTGTCTTTCC 208 hsa-miR-98 TGAGGTAGTAAGTTGTATTGTT 154hsa-miR-106a hsa-miR-19b-2* AGTTTTGCAGGTTTGCATTTCA 209 hsa-miR-20bCAAAGTGCTCATAGTGCAGGTAG 160 hsa-miR-20b* ACTGTAGTATGGGCACTTCCAG 210hsa-miR-363 AATTGCACGGTATCCATCTGTA 184 hsa-miR-363*CGGGTGGATCACGATGCAATTT 211 hsa-miR-92a TATTGCACTTGTCCCGGCCTGT 187hsa-miR-92a-2* GGGTGGGGATTTGTTGCATTAC 212 hsa-miR-106a*CTGCAATGTAAGCACTTCTTAC 213 hsa-miR-18b TAAGGTGCATCTAGTGCAGTTAG 214hsa-miR-18b* TGCCCTAAATGCCCCTTCTGGC 215 hsa-miR-19bTGTGCAAATCCATGCAAAACTGA  23 hsa-miR-10a hsa-miR-10a*CAAATTCGTATCTAGGGGAATA 216 hsa-miR-10b hsa-miR-10b*ACAGATTCGATTCTAGGGGAAT 217 hsa-miR-122 hsa-miR-122*AACGCCATTATCACACTAAATA 218 hsa-miR-124 hsa-miR-124*CGTGTTCACAGCGGACCTTGAT 219 hsa-miR-125b hsa-miR-125b-1*ACGGGTTAGGCTCTTGGGAGCT 220 hsa-miR-125b-2* TCACAAGTCAGGCTCTTGGGAC 221hsa-miR-99a AACCCGTAGATCCGATCTTGTG 222 hsa-miR-99a*CAAGCTCGCTTCTATGGGTCTG 223 hsa-miR-100 AACCCGTAGATCCGAACTTGTG 224hsa-miR-100* CAAGCTTGTATCTATAGGTATG 225 hsa-let-7aTGAGGTAGTAGGTTGTATAGTT 155 hsa-let-7c TGAGGTAGTAGGTTGTATGGTT 156hsa-let-7c* TAGAGTTACACCCTGGGAGTTA 226 hsa-miR-126 hsa-miR-126*CATTATTACTTTTGGTACGCG 227 hsa-miR-130a hsa-miR-130a*TTCACATTGTGCTACTGTCTGC 228 hsa-miR-138 hsa-miR-138-1*GCTACTTCACAACACCAGGGCC 229 hsa-miR-138-2* GCTATTTCACGACACCAGGGTT 230hsa-miR-142-3p hsa-miR-142-5p CATAAAGTAGAAAGCACTACT 231 hsa-miR-143hsa-miR-143* GGTGCAGTGCTGCATCTCTGGT 232 hsa-miR-145GTCCAGTTTTCCCAGGAATCCCT 233 hsa-miR-145* GGATTCCTGGAAATACTGTTCT 234hsa-miR-146a hsa-miR-146a* CCTCTGAAATTCAGTTCTTCAG 235 hsa-miR-146b-5phsa-miR-146b-3p TGCCCTGTGGACTCAGTTCTGG 236 hsa-miR-148b hsa-miR-148b*AAGTTCTGTTATACACTCAGGC 237 hsa-miR-15b hsa-miR-15b*CGAATCATTATTTGCTGCTCTA 238 hsa-miR-16 TAGCAGCACGTAAATATTGGCG 176hsa-miR-16-2* CCAATATTACTGTGCTGCTTTA 239 hsa-miR-185 hsa-miR-185*AGGGGCTGGCTTTCCTCTGGTC 240 hsa-miR-1306 ACGTTGGCTCTGGTGGTG 241hsa-miR-192 hsa-miR-192* CTGCCAATTCCATAGGTCACAG 242 hsa-miR-194TGTAACAGCAACTCCATGTGGA 243 hsa-miR-194* CCAGTGGGGCTGCTGTTATCTG 244hsa-miR-193a-3p hsa-miR-193a-5p TGGGTCTTTGCGGGCGAGATGA 245 hsa-miR-365TAATGCCCCTAAAAATCCTTAT 246 hsa-miR-19b hsa-miR-19aTGTGCAAATCTATGCAAAACTGA 179 hsa-miR-19a* AGTTTTGCATAGTTGCACTACA 247hsa-miR-18a TAAGGTGCATCTAGTGCAGATAG 248 hsa-miR-18a*ACTGCCCTAAGTGCTCCTTCTGG 249 hsa-miR-18b TAAGGTGCATCTAGTGCAGTTAG 214hsa-miR-18b* TGCCCTAAATGCCCCTTCTGGC 215 hsa-miR-17CAAAGTGCTTACAGTGCAGGTAG 162 hsa-miR-17* ACTGCAGTGAAGGCACTTGTAG 250hsa-miR-106a AAAAGTGCTTACAGTGCAGGTAG   3 hsa-miR-106a*CTGCAATGTAAGCACTTCTTAC 213 hsa-miR-20a* ACTGCATTATGAGCACTTAAAG 251hsa-miR-20b CAAAGTGCTCATAGTGCAGGTAG 160 hsa-miR-20b*ACTGTAGTATGGGCACTTCCAG 210 hsa-miR-363 AATTGCACGGTATCCATCTGTA 184hsa-miR-363* CGGGTGGATCACGATGCAATTT 211 hsa-miR-92aTATTGCACTTGTCCCGGCCTGT 187 hsa-miR-92a-1* AGGTTGGGATCGGTTGCAATGCT 252hsa-miR-92a-2* GGGTGGGGATTTGTTGCATTAC 212 hsa-miR-19b-1*AGTTTTGCAGGTTTGCATCCAGC 253 hsa-miR-19b-2* AGTTTTGCAGGTTTGCATTTCA 209hsa-miR-20a TAAAGTGCTTATAGTGCAGGTAG  28 hsa-miR-200a hsa-miR-200b*CATCTTACTGGGCAGCATTGGA 254 hsa-miR-429 TAATACTGTCTGGTAAAACCGT 181hsa-miR-200a* CATCTTACCGGACAGTGCTGGA 255 hsa-miR-200bTAATACTGCCTGGTAATGATGA  25 hsa-miR-200b hsa-miR-200aTAACACTGTCTGGTAACGATGTT  24 hsa-miR-200a* CATCTTACCGGACAGTGCTGGA 255hsa-miR-200b* CATCTTACTGGGCAGCATTGGA 254 hsa-miR-429TAATACTGTCTGGTAAAACCGT 181 hsa-miR-200c hsa-miR-200c*CGTCTTACCCAGCAGTGTTTGG 256 hsa-miR-141 TAACACTGTCTGGTAAAGATGG 180hsa-miR-141* CATCTTCCAGTACAGTGTTGGA 257 hsa-miR-20a hsa-miR-17*ACTGCAGTGAAGGCACTTGTAG 250 hsa-miR-17 CAAAGTGCTTACAGTGCAGGTAG 162hsa-miR-18a* ACTGCCCTAAGTGCTCCTTCTGG 249 hsa-miR-18aTAAGGTGCATCTAGTGCAGATAG 248 hsa-miR-19a* AGTTTTGCATAGTTGCACTACA 247hsa-miR-19a TGTGCAAATCTATGCAAAACTGA 179 hsa-miR-20a*ACTGCATTATGAGCACTTAAAG 251 hsa-miR-92a TATTGCACTTGTCCCGGCCTGT 187hsa-miR-92a-1* AGGTTGGGATCGGTTGCAATGCT 252 hsa-miR-19b-1*AGTTTTGCAGGTTTGCATCCAGC 253 hsa-miR-19b TGTGCAAATCCATGCAAAACTGA  23hsa-miR-21 hsa-miR-21* CAACACCAGTCGATGGGCTGT 258 hsa-miR-221hsa-miR-221* ACCTGGCATACAATGTAGATTT 259 hsa-miR-222AGCTACATCTGGCTACTGGGT 183 hsa-miR-222* CTCAGTAGCCAGTGTAGATCCT 260hsa-miR-222 hsa-miR-221* ACCTGGCATACAATGTAGATTT 259 hsa-miR-222*CTCAGTAGCCAGTGTAGATCCT 260 hsa-miR-221 AGCTACATTGTCTGCTGGGTTTC  31hsa-miR-25 hsa-miR-25* AGGCGGAGACTTGGGCAATTG 261 hsa-miR-93CAAAGTGCTGTTCGTGCAGGTAG 161 hsa-miR-93* ACTGCTGAGCTAGCACTTCCCG 262hsa-miR-106b TAAAGTGCTGACAGTGCAGAT 164 hsa-miR-106b*CCGCACTGTGGGTACTTGCTGC 263 hsa-miR-29a hsa-miR-29a*ACTGATTTCTTTTGGTGTTCAG 264 hsa-miR-29b TAGCACCATTTGAAATCAGTGTT  35hsa-miR-29b-1* GCTGGTTTCATATGGTGGTTTAGA 265 hsa-miR-29b hsa-miR-29a*ACTGATTTCTTTTGGTGTTCAG 264 hsa-miR-29b-1* GCTGGTTTCATATGGTGGTTTAGA 265hsa-miR-29b-2* CTGGTTTCACATGGTGGCTTAG 266 hsa-miR-29cTAGCACCATTTGAAATCGGTTA  36 hsa-miR-29a TAGCACCATCTGAAATCGGTTA  34hsa-miR-29c* TGACCGATTTCTCCTGGTGTTC 267 hsa-miR-29c hsa-miR-29b-2*CTGGTTTCACATGGTGGCTTAG 266 hsa-miR-29c* TGACCGATTTCTCCTGGTGTTC 267hsa-miR-29b TAGCACCATTTGAAATCAGTGTT  35 hsa-miR-30a hsa-miR-30a*CTTTCAGTCGGATGTTTGCAGC 268 hsa-miR-30c TGTAAACATCCTACACTCTCAGC 190hsa-miR-30c-2* CTGGGAGAAGGCTGTTTACTCT 269 hsa-miR-31 hsa-miR-31*TGCTATGCCAACATATTGCCAT 270 hsa-miR-342-3p hsa-miR-342-5pAGGGGTGCTATCTGTGATTGA 271 hsa-miR-372 hsa-miR-371-3pAAGTGCCGCCATCTTTTGAGTGT 272 hsa-miR-371-5p ACTCAAACTGTGGGGGCACT 273hsa-miR-373 GAAGTGCTTCGATTTTGGGGTGT 197 hsa-miR-373*ACTCAAAATGGGGGCGCTTTCC 274 hsa-miR-378 hsa-miR-378*CTCCTGACTCCAGGTCCTGTGT 275 hsa-miR-425 hsa-miR-425*ATCGGGAATGTCGTGTCCGCCC 276 hsa-miR-191 CAACGGAATCCCAAAAGCAGCTG 277hsa-miR-191* GCTGCGCTTGGATTTCGTCCCC 278 hsa-miR-451 hsa-miR-144TACAGTATAGATGATGTACT 279 hsa-miR-144* GGATATCATCATATACTGTAAG 280hsa-miR-497 hsa-miR-195 TAGCAGCACAGAAATATTGGC 174 hsa-miR-195*CCAATATTGGCTGTGCTGCTCC 281 hsa-miR-497* CAAACCACACTGTGGTGTTAGA 282hsa-miR-9* hsa-miR-9 TCTTTGGTTATCTAGCTGTATGA 283 hsa-miR-92bhsa-miR-92b* AGGGACGGGACGCGGTGCAGTG 284

For some of the microRNAs in Table 3, other microRNAs that are known inthe human genome have similar sequences (less than 6 mismatches in thesequence) (see Table 7), and may therefore also be captured by probeswith the same design. These microRNAs with similar overall sequence maybe substituted for the indicated miRs.

TABLE 7 microRNAs with similar sequence miRs in Indicated miRssequence cluster Sequence SEQ ID NO: hsa-let-7b hsa-let-7aTGAGGTAGTAGGTTGTATAGTT 155 hsa-let-7e TGAGGTAGGAGGTTGTATAGTT 153hsa-let-7c TGAGGTAGTAGGTTGTATGGTT 156 hsa-let-7f TGAGGTAGTAGATTGTATAGTT  2 hsa-let-7d AGAGGTAGTAGGTTGCATAGTT 152 hsa-miR-1827TGAGGCAGTAGATTGAAT 285 hsa-let-7g TGAGGTAGTAGTTTGTACAGTT 157 hsa-miR-98TGAGGTAGTAAGTTGTATTGTT 154 hsa-let-7f hsa-let-7b TGAGGTAGTAGGTTGTGTGGTT  1 hsa-let-7c TGAGGTAGTAGGTTGTATGGTT 156 hsa-miR-1827TGAGGCAGTAGATTGAAT 285 hsa-let-7g TGAGGTAGTAGTTTGTACAGTT 157 hsa-miR-98TGAGGTAGTAAGTTGTATTGTT 154 hsa-let-7d AGAGGTAGTAGGTTGCATAGTT 152hsa-let-7e TGAGGTAGGAGGTTGTATAGTT 153 hsa-let-7a TGAGGTAGTAGGTTGTATAGTT155 hsa-miR-106a hsa-miR-17 CAAAGTGCTTACAGTGCAGGTAG 162 hsa-miR-93CAAAGTGCTGTTCGTGCAGGTAG 161 hsa-miR-106b TAAAGTGCTGACAGTGCAGAT 164hsa-miR-20b CAAAGTGCTCATAGTGCAGGTAG 160 hsa-miR-20aTAAAGTGCTTATAGTGCAGGTAG  28 hsa-miR-10a hsa-miR-10bTACCCTGTAGAACCGAATTTGTG 165 hsa-miR-10b hsa-miR-10 aTACCCTGTAGATCCGAATTTGTG   4 hsa-miR-130a hsa-miR-130bCAGTGCAATGATGAAAGGGCAT 170 hsa-miR-146a hsa-miR-146b-5pTGAGAACTGAATTCCATAGGCT  16 hsa-miR-146b-5p hsa-miR-146aTGAGAACTGAATTCCATGGGTT  15 hsa-miR-148b hsa-miR-148aTCAGTGCACTACAGAACTTTGT 172 hsa-miR-148b hsa-miR-152TCAGTGCATGACAGAACTTGG  18 hsa-miR-152 hsa-miR-148bTCAGTGCATCACAGAACTTTGT  17 hsa-miR-148a TCAGTGCACTACAGAACTTTGT 172hsa-miR-15b hsa-miR-15a TAGCAGCACATAATGGTTTGTG 175 hsa-miR-192hsa-miR-215 ATGACCTATGAATTGACAGAC 177 hsa-miR-193a-3p hsa-miR-193bAACTGGCCCTCAAAGTCCCGCT 178 hsa-miR-19b hsa-miR-19aTGTGCAAATCTATGCAAAACTGA 179 hsa-miR-200a hsa-miR-141TAACACTGTCTGGTAAAGATGG 180 hsa-miR-200b hsa-miR-200cTAATACTGCCGGGTAATGATGGA  26 hsa-miR-200c hsa-miR-200bTAATACTGCCTGGTAATGATGA  25 hsa-miR-20a hsa-miR-106bTAAAGTGCTGACAGTGCAGAT 164 hsa-miR-20b CAAAGTGCTCATAGTGCAGGTAG 160hsa-miR-93 CAAAGTGCTGTTCGTGCAGGTAG 161 hsa-miR-17CAAAGTGCTTACAGTGCAGGTAG 162 hsa-miR-106a AAAAGTGCTTACAGTGCAGGTAG   3hsa-miR-29a hsa-miR-29c TAGCACCATTTGAAATCGGTTA  36 hsa-miR-29bTAGCACCATTTGAAATCAGTGTT  35 hsa-miR-29b hsa-miR-29aTAGCACCATCTGAAATCGGTTA  34 hsa-miR-29c TAGCACCATTTGAAATCGGTTA  36hsa-miR-29c hsa-miR-29b TAGCACCATTTGAAATCAGTGTT  35 hsa-miR-29aTAGCACCATCTGAAATCGGTTA  34 hsa-miR-30a hsa-miR-30dTGTAAACATCCCCGACTGGAAG 188 hsa-miR-30e TGTAAACATCCTTGACTGGAAG 191hsa-miR-378 hsa-miR-422a ACTGGACTTAGGGTCAGAAGGC 203 hsa-miR-92bhsa-miR-92a TATTGCACTTGTCCCGGCCTGT 187

The foregoing description of the specific embodiments so fully revealsthe general nature of the invention that others can, by applying currentknowledge, readily modify and/or adapt for various applications suchspecific embodiments without undue experimentation and without departingfrom the generic concept, and, therefore, such adaptations andmodifications should and are intended to be comprehended within themeaning and range of equivalents of the disclosed embodiments. Althoughthe invention has been described in conjunction with specificembodiments thereof, it is evident that many alternatives, modificationsand variations will be apparent to those skilled in the art.Accordingly, it is intended to embrace all such alternatives,modifications and variations that fall within the spirit and broad scopeof the appended claims.

It should be understood that the detailed description and specificexamples, while indicating preferred embodiments of the invention, aregiven by way of illustration only, since various changes andmodifications within the spirit and scope of the invention will becomeapparent to those skilled in the art from this detailed description.

REFERENCES

-   1. Bentwich, I. et al. Identification of hundreds of conserved and    nonconserved human microRNAs. Nat Genet (2005).-   2. Farh, K. K. et al. The Widespread Impact of Mammalian MicroRNAs    on mRNA Repression and Evolution. Science (2005).-   3. Griffiths-Jones, S., Grocock, R. J., van Dongen, S., Bateman, A.    & Enright, A. J. miRBase: microRNA sequences, targets and gene    nomenclature. Nucleic Acids Res 34, D140-4 (2006).-   4. He, L. et al. A microRNA polycistron as a potential human    oncogene. Nature 435, 828-33 (2005).-   5. Baskerville, S. & Bartel, D. P. Microarray profiling of microRNAs    reveals frequent coexpression with neighboring miRNAs and host    genes. Rna 11, 241-7 (2005).-   6. Landgraf, P. et al. A Mammalian microRNA Expression Atlas Based    on Small RNA Library Sequencing. Cell 129, 1401-14 (2007).-   7. Volinia, S. et al. A microRNA expression signature of human solid    tumors defines cancer gene targets. Proc Natl Acad Sci USA (2006).-   8. Lu, J. et al. MicroRNA expression profiles classify human    cancers. Nature 435, 834-8 (2005).-   9. Varadhachary, G. R., Abbruzzese, J. L. & Lenzi, R. Diagnostic    strategies for unknown primary cancer. Cancer 100, 1776-85 (2004).-   10. Pimiento, J. M., Teso, D., Malkan, A., Dudrick, S. J. &    Palesty, J. A. Cancer of unknown primary origin: a decade of    experience in a community-based hospital. Am J Surg 194, 833-7;    discussion 837-8 (2007).-   11. Shaw, P. H., Adams, R., Jordan, C. & Crosby, T. D. A clinical    review of the investigation and management of carcinoma of unknown    primary in a single cancer network. Clin Oncol (R Coll Radiol) 19,    87-95 (2007).-   12. Hainsworth, J. D. & Greco, F. A. Treatment of patients with    cancer of an unknown primary site. N Engl J Med 329, 257-63 (1993).-   13. Blaszyk, H., Hartmann, A. & Bjornsson, J. Cancer of unknown    primary: clinicopathologic correlations. Apmis 111, 1089-94 (2003).-   14. Bloom, G. et al. Multi-platform, multi-site, microarray-based    human tumor classification. Am J Pathol 164, 9-16 (2004).-   15. Ma, X. J. et al. Molecular classification of human cancers using    a 92-gene real-time quantitative polymerase chain reaction assay.    Arch Pathol Lab Med 130, 465-73 (2006).-   16. Talantov, D. et al. A quantitative reverse    transcriptase-polymerase chain reaction assay to identify metastatic    carcinoma tissue of origin. J Mol Diagn 8, 320-9 (2006).-   17. Tothill, R. W. et al. An expression-based site of origin    diagnostic method designed for clinical application to cancer of    unknown origin. Cancer Res 65, 4031-40 (2005).-   18. Shedden, K. A. et al. Accurate molecular classification of human    cancers based on gene expression using a simple classifier with a    pathological tree-based framework. Am J Pathol 163, 1985-95 (2003).-   19. Raver-Shapira, N. et al. Transcriptional Activation of miR-34a    Contributes to p53-Mediated Apoptosis. Mol Cell (2007).-   20. Xiao, C. et al. MiR-150 Controls B Cell Differentiation by    Targeting the Transcription Factor c-Myb. Cell 131, 146-59 (2007).

The contents of U.S. patent application Ser. No. 14/320,113, filed Jun.30, 2014; U.S. patent application Ser. No. 13/167,489, filed Jun. 23,2011; International Application No. PCT/IL09/01212, filed Dec. 23, 2009;U.S. Provisional Application No. 61/140,642, filed Dec. 24, 2008; U.S.patent application Ser. No. 12/532,940, filed Sep. 24, 2009;International Application No. PCT/IL08/00396, filed Mar. 20, 2008; U.S.Provisional Application No. 60/907,266, filed Mar. 27, 2007; U.S.Provisional Application No. 60/929,244, filed Jun. 19, 2007; and, U.S.Provisional Application No. 61/024,565, filed Jan. 30, 2008, are hereinincorporated by reference in their entirety for all purposes.

1. A method of producing thyroid medullary cancer cDNA sequences, the method comprising: isolating RNA from a sample from a human subject, wherein the sample comprises a tumor cell; contacting the RNA with a polyadenylation agent under conditions that are sufficient to form a polyadenylated RNA; reverse transcribing the polyadenylated RNA in the presence of a universal poly(T) adapter to produce cDNA sequences comprising a poly(T) tail; and amplifying the cDNA sequences using a forward primer that is specific for SEQ ID NO:42; thereby producing thyroid medullary cancer cDNA sequences.
 2. The method of claim 1, wherein the amplifying further comprises amplifying the cDNA sequences with a reverse primer that is complementary to the poly(T) tail.
 3. The method of claim 1, wherein the sample is a biopsy.
 4. The method of claim 1, wherein the sample is a fine-needle aspiration.
 5. The method of claim 1, wherein the amplifying comprises quantitative PCR.
 6. A reaction mixture for generating cDNA sequences derived from a thyroid medullary cancer, the reaction mixture comprising: a nucleic acid sample obtained from a biological sample from a human subject, wherein the biological sample comprises a tumor cell; a primer for generating cDNA sequences, wherein the primer is specific for SEQ ID NO:42; and a detectable probe.
 7. The reaction mixture of claim 6, wherein the nucleic acid sample comprises cDNA sequences comprising a poly(T) tail, wherein the cDNA sequences are produced from an RNA sample that has been polyadenylated and reverse transcribed in the presence of a universal poly(T) adapter.
 8. The reaction mixture of claim 7, wherein the reaction mixture further comprises a reverse primer that is complementary to the poly(T) tail.
 9. The reaction mixture of claim 6, wherein the biological sample is a biopsy or a fine-needle aspiration.
 10. A method of detecting gene expression in a sample from a human subject, the method comprising: detecting the presence of SEQ ID NO:42 in a nucleic acid sample from the subject, wherein the detecting step comprises: contacting the nucleic acid sample with a primer specific for SEQ ID NO:42; amplifying at least one nucleic acid sequence in the sample; and detecting the presence of an amplified nucleic acid sequence comprising SEQ ID NO:42; thereby detecting the gene expression in the sample.
 11. The method of claim 10, wherein the nucleic acid sample is an RNA sample isolated from a biological sample from the subject, wherein the biological sample comprises a tumor cell.
 12. The method of claim 11, wherein the biological sample is a biopsy.
 13. The method of claim 11, wherein the biological sample is a fine-needle aspiration.
 14. The method of claim 11, wherein prior to the contacting step, the method comprises contacting the RNA sample with a polyadenylation agent under conditions that are sufficient to form a polyadenylated RNA and reverse transcribing the polyadenylated RNA in the presence of a universal poly(T) adapter to produce a nucleic acid sample comprising cDNA sequences.
 15. The method of claim 10, wherein the amplifying step comprises quantitative PCR.
 16. A method of identifying a subject as having a cancer of thyroid medullary origin, the method comprising: obtaining a nucleic acid sample from a human subject, wherein the nucleic acid sample is from a biological sample comprising a tumor cell; contacting the nucleic acid sample with a primer specific for SEQ ID NO:42; amplifying the nucleic acid sequences in the biological sample; and detecting the presence of an amplified nucleic acid sequence comprising SEQ ID NO:42; thereby identifying the subject as having a cancer of thyroid medullary origin.
 17. The method of claim 16, wherein the nucleic acid sample is an RNA sample, and wherein prior to the contacting step, the method comprises contacting the RNA sample with a polyadenylation agent under conditions that are sufficient to form a polyadenylated RNA and reverse transcribing the polyadenylated RNA in the presence of a universal poly(T) adapter to produce a nucleic acid sample comprising cDNA sequences.
 18. The method of claim 16, wherein the biological sample is a biopsy.
 19. The method of claim 16, wherein the biological sample is a fine-needle aspiration.
 20. The method of claim 16, wherein the amplifying step comprises quantitative PCR.
 21. The method of claim 16, wherein the detecting step comprises measuring the relative abundance of an amplified nucleic acid sequence comprising SEQ ID NO:42, relative to a reference value, and identifying the subject as having a cancer of thyroid medullary origin based on the relative abundance of the amplified nucleic acid sequence comprising SEQ ID NO:42. 